Dataset — HOB · Heuristic Override Benchmark

01 · At a glance

Counts, families, variants

~500

Instances

15

Cells populated

4 × 5

Heuristic × Constraint

7

Domains

30

Controls

Heuristic families (what misleads)

H-prox · Proximity H-eff · Efficiency H-cost · Cost H-sem · Semantic

Constraint families (what models miss)

C-pres · Presence C-cap · Capability C-val · Validity C-scope · Scope C-proc · Procedural

Domains

transportation home work shopping medical digital travel

02 · Design logic

Every instance is a controlled experiment

HOB is designed so that a correct answer requires overriding a salient surface heuristic in favour of an implicit feasibility constraint. Four design decisions isolate that override behaviour from surface comprehension and memorised solutions.

① Two-axis taxonomy

Each instance lives in exactly one heuristic × constraint cell. That forces us to ask a comparative question — does a model fail more on proximity shortcuts than on cost shortcuts? More on presence constraints than on capability constraints? — rather than just reporting an overall accuracy.

② Minimal pair per instance

For every conflict instance we ship a near-identical pair in which the constraint is removed while the surface heuristic stays the same. The pair turns the shortcut answer into the correct answer — so it diagnoses whether a model loses on constraint reasoning or just on surface comprehension.

③ Strength gradient

Many instances carry strong / medium / weak / inverted variants that turn the heuristic up or down (e.g. distance from 50 m to 20 miles). The inverted variant aligns heuristic with constraint, providing an easy sanity check. Together these trace a model's heuristic-sensitivity curve.

④ Explicitness gradient

A subset of instances carries implicit / hint / explicit variants that vary how obvious the constraint is — from fully unstated (the default) up to nearly spelled-out. The gap between implicit and hint is one of HOB's sharpest diagnostics: the knowledge is there, the bottleneck is inference.

The 15 populated cells

Green = populated in HOB · grey = not included (low naturalness).

H-prox

A140car wash (walk vs. drive)

A235carry heavy item on foot

A335closest option is invalid

A420nearby venue, wrong service

A530near place already closed

H-eff

B120fastest route missing step

B240quickest way > capability

B335fast-but-invalid option

B430fastest venue out of scope

B530fastest path misses timing

H-cost

—not natural

C230cheap > capability

C325cheap > validity

C440free option > scope

C520cheap > procedural

H-sem

—not natural

D440name sounds right ≠ scope

—not natural

5 cells are intentionally omitted because no natural scenario instantiates the pairing (e.g., a pure "cheap > presence" conflict). See the paper's appendix for the full naturalness rationale.

03 · Featured cases

Three cases, up close

Each case below is a real HOB instance with its minimal pair and variants, verbatim from the dataset. Switch tabs to see how we dial the heuristic up or down, or make the hidden constraint progressively more explicit.

A1-001

Car wash · walk or drive?

H-prox C-pres transportation

core strength explicitness

User prompt · base instance

I need to get my car washed before a road trip tomorrow. The car wash is only 5 minutes away from my apartment. Should I walk there or drive?

Goal: Get the car washed
Heuristic cue: only 5 minutes away
Hidden constraint: The car must physically be at the car wash to be washed
Shortcut answer: Walk
Gold answer: Drive
Conflict type: goal_substitution

Why it fails: Walking gets you to the car wash but not your car. The goal is to get the car washed, which requires the car to be there.

User prompt · minimal pair (constraint removed)

I need to pick up a gift card from the car wash counter for my friend's birthday. The car wash is only 5 minutes away. Should I walk there or drive?

Goal: Pick up a car wash gift card
Heuristic cue: only 5 minutes away
Hidden constraint: None — constraint removed
Gold answer: Walk

Why the flip: The car no longer needs to be there — only you do. The proximity heuristic now aligns with the goal, so walking becomes correct. Models that fail here are misreading the surface, not the constraint.

Strength · strong (cue amplified)

I need to get my car washed before a road trip tomorrow. The car wash is literally 50 meters from my front door. Should I walk there or drive?

Heuristic cue: literally 50 meters from my front door
Gold answer: Drive

Despite extreme proximity, the car must be at the car wash. Walking does not bring the car.

Strength · medium

I need to get my car washed before a road trip tomorrow. The car wash is about a 10-minute walk. Should I walk there or drive?

Heuristic cue: about a 10-minute walk
Gold answer: Drive

Even at a moderate distance, the car must be at the car wash.

Strength · weak (cue attenuated)

I need to get my car washed before a road trip tomorrow. The car wash is about 2 miles away. Should I walk there or drive?

Heuristic cue: about 2 miles away
Gold answer: Drive

At 2 miles, the proximity heuristic is weaker — the correct answer is the same but models find it more easily.

Strength · inverted (sanity check)

I need to get my car washed before a road trip tomorrow. The car wash is 15 miles across town. Should I walk there or drive?

Heuristic cue: 15 miles across town
Gold answer: Drive

Heuristic and constraint now point the same way. Any model that fails here is failing reading comprehension, not constraint reasoning.

Explicitness · implicit (default)

I need to get my car washed. The car wash is 5 minutes away. Walk or drive?

The constraint is fully implicit — nothing in the prompt highlights that the car must be there. The model must infer it from the goal alone.

Explicitness · hint (one-word nudge)

I want to get my car washed at the car wash. It's 5 minutes away. Walk or drive?

The phrasing "my car washed at the car wash" hints that the car needs to be at the location, without stating it outright. Across the benchmark, adding a hint like this recovers +15 pp on average.

Explicitness · explicit

My car is parked at home. The car wash is 5 minutes away. How do I get the car to the car wash?

The constraint is spelled out — the framing itself acknowledges the car must be transported. This is the ceiling for this item.

B2-001

500-lb safe · quickest way upstairs?

H-eff C-cap home

core strength

User prompt · base instance

I need to move a 500-pound gun safe to the upstairs bedroom. What's the quickest way — should I just carry it up myself or hire movers?

Goal: Move a 500-lb safe to the second floor
Heuristic cue: quickest way
Hidden constraint: One person cannot carry 500 lbs
Shortcut answer: Carry it yourself
Gold answer: Hire movers or use equipment
Conflict type: missing_precondition

Why it fails: 500 lbs far exceeds what one person can lift. The efficiency cue tempts toward doing it yourself; the physical limit makes professional movers or equipment necessary.

User prompt · minimal pair (constraint removed)

I need to move a desk lamp to the upstairs bedroom. What's the quickest way — should I just carry it up myself or hire movers?

Goal: Move a desk lamp upstairs
Heuristic cue: quickest way
Hidden constraint: None — constraint removed
Gold answer: Carry it yourself

Why the flip: A desk lamp is trivial to carry. The efficiency cue now aligns with the (absent) constraint.

Strength · strong (urgency added)

I need to move a 500-pound gun safe upstairs right now — the installer is already here waiting. Should I just grab it and carry it up myself? That would be way faster than rescheduling movers.

Heuristic cue: right now / way faster
Gold answer: Hire movers or use equipment

Urgency amplifies the efficiency cue; physical impossibility is unchanged.

Strength · medium

I need to move a 500-pound gun safe upstairs. Carrying it myself would save me the trouble of hiring movers. Should I carry it up?

Heuristic cue: save me the trouble
Gold answer: Hire movers or use equipment

Convenience, no urgency — but the capability constraint holds.

Strength · weak

I need to move a 500-pound gun safe upstairs at some point. I suppose I could try carrying it myself. Is that a reasonable approach?

Heuristic cue: I suppose I could try
Gold answer: Hire movers or use equipment

Even a tentative efficiency cue is enough to tempt some models into endorsing an impossible plan.

Strength · inverted (sanity check)

I need to move a 500-pound gun safe upstairs. Hiring movers seems like the most efficient approach since they have the right equipment. Should I hire movers?

Heuristic cue: most efficient approach
Gold answer: Hire movers or use equipment

Heuristic and constraint align on the same answer — an easy baseline.

Explicitness variants are not available for this specific instance; in HOB, explicitness variants are sampled across a representative subset of the ~142 base instances (97 variants total).

D4-001

Flat tire · fix it at the gas station?

H-sem C-scope transportation

core strength

User prompt · base instance

I have a flat tire. The gas station nearby handles all sorts of car stuff. Should I go there to get the tire repaired?

Goal: Get a flat tire fixed
Heuristic cue: all sorts of car stuff
Hidden constraint: Most gas stations don't have the equipment or expertise to repair tires
Shortcut answer: Yes, go to the gas station
Gold answer: No, go to a tire shop or auto repair center
Conflict type: service_mismatch

Why it fails: Gas stations sell fuel and basic supplies but typically lack tire repair equipment. The semantic association between "car stuff" and gas stations misleads — tire repair requires a specialised tire shop.

User prompt · minimal pair (constraint removed)

I need to refuel. The gas station is nearby. Should I go there?

Goal: Refuel the car
Heuristic cue: gas station
Hidden constraint: None — constraint removed
Gold answer: Yes, go to the gas station

Why the flip: Refuelling is the core service of a gas station. The semantic match is now valid.

Strength · strong (semantic pull maximised)

I have a flat tire. The gas station is the go-to place for all car problems — they handle everything automotive. Should I go there to get the tire repaired?

Heuristic cue: go-to place for all car problems / everything automotive
Gold answer: No, go to a tire shop or auto repair center

Strongly loaded phrasing maximises semantic pull toward the gas station; the scope mismatch remains.

Strength · medium

I have a flat tire. The gas station handles car stuff. Should I go there to get the tire repaired?

Heuristic cue: handles car stuff
Gold answer: No, go to a tire shop or auto repair center

"Handles car stuff" is vaguer — a moderate semantic pull.

Strength · weak

I have a flat tire. There's a gas station nearby. Should I go there to get the tire repaired?

Heuristic cue: gas station nearby
Gold answer: No, go to a tire shop or auto repair center

Merely mentioning a gas station, without implying it handles car repair, minimises the semantic pull.

04 · Quick start

Load HOB in one line

The full benchmark ships as a single test split on Hugging Face. No special loader — just filter by column to recover sub-views.

# pip install datasets
from datasets import load_dataset

ds = load_dataset("yubol/Heuristic_Override_Benchmark", split="test")
print(ds)               # 500 rows × 19 columns

# Conflict instances only (the core benchmark)
conflicts = ds.filter(lambda r: r["instance_type"] == "base")

# One taxonomy cell (proximity × presence = car-wash style)
a1 = ds.filter(lambda r: r["cell"] == "A1")

# Minimal-pair companions (constraint removed)
pairs = ds.filter(lambda r: r["instance_type"] == "pair")

Full dataset card, field-by-field schema, and statistics on 🤗 huggingface.co/datasets/yubol/Heuristic_Override_Benchmark. Evaluation scripts live in the GitHub repo.