Everything you need to reproduce the results and extend the benchmark.
A uv-managed Python package covering Stage 1 (query), Stage 2 (LLM-as-judge), Stage 3 (aggregate) — and the mechanistic-analysis pipeline for Study 1.
GitHub repository ↗~500 instances across 15 H × C cells + 30 controls. Each instance is self-contained JSON; every instance has a minimal pair and three explicitness variants.
Dataset (JSON) ↗# 1. Install uv sync cp .env.example .env # fill in API keys you plan to use # 2. Evaluate a model (collect + judge end-to-end) PYTHONPATH=src python -m hob.run_benchmark --models gpt-5.4 --trials 10 # 3. Or the staged pipeline (judge once for many models) PYTHONPATH=src python -m hob.collect --model gpt-5.4 --trials 10 PYTHONPATH=src python -m hob.judge --input-dir results/ PYTHONPATH=src python -m hob.analysis --input-dir results/
Supported model keys: gpt-5.2, gpt-5.4, claude-opus-4.6, claude-sonnet-4.5, deepseek-r1, gemini-3-pro, grok-4.2, kimi-k2.5, gpt-oss-120b, llama-4, qwen3-32b, qwen3-14b, qwen3.5-27b, gpt-oss-20b.
@article{li2026model,
title={The Model Says Walk: How Surface Heuristics Override Implicit Constraints in LLM Reasoning},
author={Li, Yubo and Zhang, Lu and Jiang, Tianchong and Krishnan, Ramayya and Padman, Rema},
journal={arXiv preprint arXiv:2603.29025},
year={2026}
}
Open an issue or discussion on the GitHub repository, or email the corresponding author: yubol@andrew.cmu.edu.
To add a new instance, drop a JSON blob into data/hob_benchmark_full.json under instances — the schema is documented in data/README.md.