HOB contains ~500 instances spanning 15 heuristic × constraint cells across 7 everyday domains. Every conflict instance ships with a minimal pair, and a large share also carry strength and explicitness variants — enabling fine-grained diagnosis of when a model overrides a shortcut.
HOB is designed so that a correct answer requires overriding a salient surface heuristic in favour of an implicit feasibility constraint. Four design decisions isolate that override behaviour from surface comprehension and memorised solutions.
Each instance lives in exactly one heuristic × constraint cell. That forces us to ask a comparative question — does a model fail more on proximity shortcuts than on cost shortcuts? More on presence constraints than on capability constraints? — rather than just reporting an overall accuracy.
For every conflict instance we ship a near-identical pair in which the constraint is removed while the surface heuristic stays the same. The pair turns the shortcut answer into the correct answer — so it diagnoses whether a model loses on constraint reasoning or just on surface comprehension.
Many instances carry strong / medium / weak / inverted variants that turn the heuristic up or down (e.g. distance from 50 m to 20 miles). The inverted variant aligns heuristic with constraint, providing an easy sanity check. Together these trace a model's heuristic-sensitivity curve.
A subset of instances carries implicit / hint / explicit variants that vary how obvious the constraint is — from fully unstated (the default) up to nearly spelled-out. The gap between implicit and hint is one of HOB's sharpest diagnostics: the knowledge is there, the bottleneck is inference.
Green = populated in HOB · grey = not included (low naturalness).
5 cells are intentionally omitted because no natural scenario instantiates the pairing (e.g., a pure "cheap > presence" conflict). See the paper's appendix for the full naturalness rationale.
Each case below is a real HOB instance with its minimal pair and variants, verbatim from the dataset. Switch tabs to see how we dial the heuristic up or down, or make the hidden constraint progressively more explicit.
Explicitness variants are not available for this specific instance; in HOB, explicitness variants are sampled across a representative subset of the ~142 base instances (97 variants total).
The full benchmark ships as a single test split on Hugging Face. No special loader — just filter by column to recover sub-views.
# pip install datasets from datasets import load_dataset ds = load_dataset("yubol/Heuristic_Override_Benchmark", split="test") print(ds) # 500 rows × 19 columns # Conflict instances only (the core benchmark) conflicts = ds.filter(lambda r: r["instance_type"] == "base") # One taxonomy cell (proximity × presence = car-wash style) a1 = ds.filter(lambda r: r["cell"] == "A1") # Minimal-pair companions (constraint removed) pairs = ds.filter(lambda r: r["instance_type"] == "pair")
Full dataset card, field-by-field schema, and statistics on 🤗 huggingface.co/datasets/yubol/Heuristic_Override_Benchmark. Evaluation scripts live in the GitHub repo.