Resources

Code, data, and citation

Everything you need to reproduce the results and extend the benchmark.

Benchmark & pipeline

A uv-managed Python package covering Stage 1 (query), Stage 2 (LLM-as-judge), Stage 3 (aggregate) — and the mechanistic-analysis pipeline for Study 1.

GitHub repository ↗

Data

~500 instances across 15 H × C cells + 30 controls. Each instance is self-contained JSON; every instance has a minimal pair and three explicitness variants.

Dataset (JSON) ↗

Quickstart

# 1. Install
uv sync
cp .env.example .env              # fill in API keys you plan to use

# 2. Evaluate a model (collect + judge end-to-end)
PYTHONPATH=src python -m hob.run_benchmark --models gpt-5.4 --trials 10

# 3. Or the staged pipeline (judge once for many models)
PYTHONPATH=src python -m hob.collect  --model gpt-5.4 --trials 10
PYTHONPATH=src python -m hob.judge    --input-dir results/
PYTHONPATH=src python -m hob.analysis --input-dir results/

Supported model keys: gpt-5.2, gpt-5.4, claude-opus-4.6, claude-sonnet-4.5, deepseek-r1, gemini-3-pro, grok-4.2, kimi-k2.5, gpt-oss-120b, llama-4, qwen3-32b, qwen3-14b, qwen3.5-27b, gpt-oss-20b.

Evaluation protocol

  1. Present question zero-shot (no system prompt).
  2. Collect N = 10 trials per instance per model.
  3. Judge each response with Qwen3-32B → correct / incorrect / ambiguous.
  4. Strict accuracy: an instance counts as correct only if all N trials are correct.

BibTeX

@article{li2026model,
  title={The Model Says Walk: How Surface Heuristics Override Implicit Constraints in LLM Reasoning},
  author={Li, Yubo and Zhang, Lu and Jiang, Tianchong and Krishnan, Ramayya and Padman, Rema},
  journal={arXiv preprint arXiv:2603.29025},
  year={2026}
}

Questions, issues, contributions

Open an issue or discussion on the GitHub repository, or email the corresponding author: yubol@andrew.cmu.edu.

To add a new instance, drop a JSON blob into data/hob_benchmark_full.json under instances — the schema is documented in data/README.md.