Dataset

HOB: what's inside the benchmark

HOB contains ~500 instances spanning 15 heuristic × constraint cells across 7 everyday domains. Every conflict instance ships with a minimal pair, and a large share also carry strength and explicitness variants — enabling fine-grained diagnosis of when a model overrides a shortcut.

Counts, families, variants

~500

Instances

15

Cells populated

4 × 5

Heuristic × Constraint

7

Domains

30

Controls

Heuristic families (what misleads)

H-prox · Proximity H-eff · Efficiency H-cost · Cost H-sem · Semantic

Constraint families (what models miss)

C-pres · Presence C-cap · Capability C-val · Validity C-scope · Scope C-proc · Procedural

Domains

transportation home work shopping medical digital travel

Every instance is a controlled experiment

HOB is designed so that a correct answer requires overriding a salient surface heuristic in favour of an implicit feasibility constraint. Four design decisions isolate that override behaviour from surface comprehension and memorised solutions.

① Two-axis taxonomy

Each instance lives in exactly one heuristic × constraint cell. That forces us to ask a comparative question — does a model fail more on proximity shortcuts than on cost shortcuts? More on presence constraints than on capability constraints? — rather than just reporting an overall accuracy.

② Minimal pair per instance

For every conflict instance we ship a near-identical pair in which the constraint is removed while the surface heuristic stays the same. The pair turns the shortcut answer into the correct answer — so it diagnoses whether a model loses on constraint reasoning or just on surface comprehension.

③ Strength gradient

Many instances carry strong / medium / weak / inverted variants that turn the heuristic up or down (e.g. distance from 50 m to 20 miles). The inverted variant aligns heuristic with constraint, providing an easy sanity check. Together these trace a model's heuristic-sensitivity curve.

④ Explicitness gradient

A subset of instances carries implicit / hint / explicit variants that vary how obvious the constraint is — from fully unstated (the default) up to nearly spelled-out. The gap between implicit and hint is one of HOB's sharpest diagnostics: the knowledge is there, the bottleneck is inference.

The 15 populated cells

Green = populated in HOB · grey = not included (low naturalness).

 
C-pres
C-cap
C-val
C-scope
C-proc
H-prox
A140car wash (walk vs. drive)
A235carry heavy item on foot
A335closest option is invalid
A420nearby venue, wrong service
A530near place already closed
H-eff
B120fastest route missing step
B240quickest way > capability
B335fast-but-invalid option
B430fastest venue out of scope
B530fastest path misses timing
H-cost
not natural
C230cheap > capability
C325cheap > validity
C440free option > scope
C520cheap > procedural
H-sem
not natural
not natural
not natural
D440name sounds right ≠ scope
not natural

5 cells are intentionally omitted because no natural scenario instantiates the pairing (e.g., a pure "cheap > presence" conflict). See the paper's appendix for the full naturalness rationale.

Three cases, up close

Each case below is a real HOB instance with its minimal pair and variants, verbatim from the dataset. Switch tabs to see how we dial the heuristic up or down, or make the hidden constraint progressively more explicit.

A1-001

Car wash · walk or drive?

H-prox C-pres transportation
core strength explicitness
User prompt · base instance
I need to get my car washed before a road trip tomorrow. The car wash is only 5 minutes away from my apartment. Should I walk there or drive?
Goal
Get the car washed
Heuristic cue
only 5 minutes away
Hidden constraint
The car must physically be at the car wash to be washed
Shortcut answer
Walk
Gold answer
Drive
Conflict type
goal_substitution
Why it fails: Walking gets you to the car wash but not your car. The goal is to get the car washed, which requires the car to be there.
User prompt · minimal pair (constraint removed)
I need to pick up a gift card from the car wash counter for my friend's birthday. The car wash is only 5 minutes away. Should I walk there or drive?
Goal
Pick up a car wash gift card
Heuristic cue
only 5 minutes away
Hidden constraint
None — constraint removed
Gold answer
Walk
Why the flip: The car no longer needs to be there — only you do. The proximity heuristic now aligns with the goal, so walking becomes correct. Models that fail here are misreading the surface, not the constraint.
Strength · strong (cue amplified)
I need to get my car washed before a road trip tomorrow. The car wash is literally 50 meters from my front door. Should I walk there or drive?
Heuristic cue
literally 50 meters from my front door
Gold answer
Drive
Despite extreme proximity, the car must be at the car wash. Walking does not bring the car.
Strength · medium
I need to get my car washed before a road trip tomorrow. The car wash is about a 10-minute walk. Should I walk there or drive?
Heuristic cue
about a 10-minute walk
Gold answer
Drive
Even at a moderate distance, the car must be at the car wash.
Strength · weak (cue attenuated)
I need to get my car washed before a road trip tomorrow. The car wash is about 2 miles away. Should I walk there or drive?
Heuristic cue
about 2 miles away
Gold answer
Drive
At 2 miles, the proximity heuristic is weaker — the correct answer is the same but models find it more easily.
Strength · inverted (sanity check)
I need to get my car washed before a road trip tomorrow. The car wash is 15 miles across town. Should I walk there or drive?
Heuristic cue
15 miles across town
Gold answer
Drive
Heuristic and constraint now point the same way. Any model that fails here is failing reading comprehension, not constraint reasoning.
Explicitness · implicit (default)
I need to get my car washed. The car wash is 5 minutes away. Walk or drive?
The constraint is fully implicit — nothing in the prompt highlights that the car must be there. The model must infer it from the goal alone.
Explicitness · hint (one-word nudge)
I want to get my car washed at the car wash. It's 5 minutes away. Walk or drive?
The phrasing "my car washed at the car wash" hints that the car needs to be at the location, without stating it outright. Across the benchmark, adding a hint like this recovers +15 pp on average.
Explicitness · explicit
My car is parked at home. The car wash is 5 minutes away. How do I get the car to the car wash?
The constraint is spelled out — the framing itself acknowledges the car must be transported. This is the ceiling for this item.
B2-001

500-lb safe · quickest way upstairs?

H-eff C-cap home
core strength
User prompt · base instance
I need to move a 500-pound gun safe to the upstairs bedroom. What's the quickest way — should I just carry it up myself or hire movers?
Goal
Move a 500-lb safe to the second floor
Heuristic cue
quickest way
Hidden constraint
One person cannot carry 500 lbs
Shortcut answer
Carry it yourself
Gold answer
Hire movers or use equipment
Conflict type
missing_precondition
Why it fails: 500 lbs far exceeds what one person can lift. The efficiency cue tempts toward doing it yourself; the physical limit makes professional movers or equipment necessary.
User prompt · minimal pair (constraint removed)
I need to move a desk lamp to the upstairs bedroom. What's the quickest way — should I just carry it up myself or hire movers?
Goal
Move a desk lamp upstairs
Heuristic cue
quickest way
Hidden constraint
None — constraint removed
Gold answer
Carry it yourself
Why the flip: A desk lamp is trivial to carry. The efficiency cue now aligns with the (absent) constraint.
Strength · strong (urgency added)
I need to move a 500-pound gun safe upstairs right now — the installer is already here waiting. Should I just grab it and carry it up myself? That would be way faster than rescheduling movers.
Heuristic cue
right now / way faster
Gold answer
Hire movers or use equipment
Urgency amplifies the efficiency cue; physical impossibility is unchanged.
Strength · medium
I need to move a 500-pound gun safe upstairs. Carrying it myself would save me the trouble of hiring movers. Should I carry it up?
Heuristic cue
save me the trouble
Gold answer
Hire movers or use equipment
Convenience, no urgency — but the capability constraint holds.
Strength · weak
I need to move a 500-pound gun safe upstairs at some point. I suppose I could try carrying it myself. Is that a reasonable approach?
Heuristic cue
I suppose I could try
Gold answer
Hire movers or use equipment
Even a tentative efficiency cue is enough to tempt some models into endorsing an impossible plan.
Strength · inverted (sanity check)
I need to move a 500-pound gun safe upstairs. Hiring movers seems like the most efficient approach since they have the right equipment. Should I hire movers?
Heuristic cue
most efficient approach
Gold answer
Hire movers or use equipment
Heuristic and constraint align on the same answer — an easy baseline.

Explicitness variants are not available for this specific instance; in HOB, explicitness variants are sampled across a representative subset of the ~142 base instances (97 variants total).

D4-001

Flat tire · fix it at the gas station?

H-sem C-scope transportation
core strength
User prompt · base instance
I have a flat tire. The gas station nearby handles all sorts of car stuff. Should I go there to get the tire repaired?
Goal
Get a flat tire fixed
Heuristic cue
all sorts of car stuff
Hidden constraint
Most gas stations don't have the equipment or expertise to repair tires
Shortcut answer
Yes, go to the gas station
Gold answer
No, go to a tire shop or auto repair center
Conflict type
service_mismatch
Why it fails: Gas stations sell fuel and basic supplies but typically lack tire repair equipment. The semantic association between "car stuff" and gas stations misleads — tire repair requires a specialised tire shop.
User prompt · minimal pair (constraint removed)
I need to refuel. The gas station is nearby. Should I go there?
Goal
Refuel the car
Heuristic cue
gas station
Hidden constraint
None — constraint removed
Gold answer
Yes, go to the gas station
Why the flip: Refuelling is the core service of a gas station. The semantic match is now valid.
Strength · strong (semantic pull maximised)
I have a flat tire. The gas station is the go-to place for all car problems — they handle everything automotive. Should I go there to get the tire repaired?
Heuristic cue
go-to place for all car problems / everything automotive
Gold answer
No, go to a tire shop or auto repair center
Strongly loaded phrasing maximises semantic pull toward the gas station; the scope mismatch remains.
Strength · medium
I have a flat tire. The gas station handles car stuff. Should I go there to get the tire repaired?
Heuristic cue
handles car stuff
Gold answer
No, go to a tire shop or auto repair center
"Handles car stuff" is vaguer — a moderate semantic pull.
Strength · weak
I have a flat tire. There's a gas station nearby. Should I go there to get the tire repaired?
Heuristic cue
gas station nearby
Gold answer
No, go to a tire shop or auto repair center
Merely mentioning a gas station, without implying it handles car repair, minimises the semantic pull.

Load HOB in one line

The full benchmark ships as a single test split on Hugging Face. No special loader — just filter by column to recover sub-views.

# pip install datasets
from datasets import load_dataset

ds = load_dataset("yubol/Heuristic_Override_Benchmark", split="test")
print(ds)               # 500 rows × 19 columns

# Conflict instances only (the core benchmark)
conflicts = ds.filter(lambda r: r["instance_type"] == "base")

# One taxonomy cell (proximity × presence = car-wash style)
a1 = ds.filter(lambda r: r["cell"] == "A1")

# Minimal-pair companions (constraint removed)
pairs = ds.filter(lambda r: r["instance_type"] == "pair")

Full dataset card, field-by-field schema, and statistics on 🤗 huggingface.co/datasets/yubol/Heuristic_Override_Benchmark. Evaluation scripts live in the GitHub repo.