Our investigation follows a diagnose–measure–bridge–treat arc: mechanistic analysis of the car-wash failure, systematic benchmarking across heuristic and constraint types, parametric probes testing generality, and a mitigation experiment.
We ask what the model relies on (span-level occlusion) and how it uses it (monotonicity curves).
We define a scalar decision score extracted via anchored teacher-forced scoring:
// Higher = model prefers WALK; lower = model prefers DRIVE s(x) = log p("Walk" | x) - log p("Drive" | x)
A fixed anchor (\nFinal:) is appended after the generation prefix to create a deterministic scoring position. For multi-token candidates, log-probabilities are aggregated via log-sum-exp across tokenisation variants. Scoring is exactly reproducible.
To identify which input component drives the decision, we perturb each span independently and measure the change in decision score:
A(z) = s(occ(x, z)) − s(x)
We apply three operators — mask neutral contradict — at three levels (sentence, span, token) and require agreement across all three to control for out-of-distribution artefacts.
|A(H)| / |A(G)|
Heuristic Dominance Ratio. HDR > 1 ⇒ heuristic more influential than goal.
|A(G)|
Constraint Sensitivity Index — how much the model's decision moves when the goal span is perturbed.
|A(H)|
Distance Sensitivity Index — how much the decision moves when the heuristic span is perturbed.
We sweep distance d over 14 log-spaced values (10 m – 100 km) in a conflict condition (car wash: Drive always correct) and a control condition (coffee shop: answer depends on distance), sampling T=5 from 7 templates per point (140 prompts per model). Correct reasoning produces a flat conflict curve and a sigmoid control; a pure heuristic produces two near-identical sigmoids.
HOB is organised along two dimensions: 4 heuristic families × 5 constraint families, yielding 15 populated cells across ~500 instances and 7 domains.
| Family | Pattern | Typical cues |
|---|---|---|
| H-prox Proximity | Closer → better | "5 min away," "next door" |
| H-eff Efficiency | Faster → better | "quickest way," "saves time" |
| H-cost Cost | Cheaper → better | "free option," "saves money" |
| H-sem Semantic | Name sounds right → viable | "gas station" for tires |
| Family | Definition | Example |
|---|---|---|
| C-pres Presence | Object must be at destination | Car must be at the car wash |
| C-cap Capability | Means cannot do the task | Can't carry a sofa on foot |
| C-val Validity | Precondition is violated | Can't drive with a flat tire |
| C-scope Scope | Service can't fulfil the goal | Gas station won't fix tires |
| C-proc Procedural | Step or timing not met | Store already closed |
Every instance has a matched variant where the constraint is removed (e.g., "get my car washed" → "pick up a car-wash gift card"), isolating constraint reasoning from surface comprehension.
Each instance is rendered at three levels: implicit (no cue), hint (one extra word), and explicit — a controlled measure of the inference bottleneck.
Three graded versions (strong / medium / weak) let us measure how reliably the heuristic can be overcome across cue salience.
We extend the parametric sweep framework to three additional H × C combinations, testing all six Study 1 models with T=10 trials per grid point (840 prompts per model).
Cost sweep: copy shop vs. courthouse for certified documents. \$0–\$500.
Time-advantage sweep: carrying a 500-lb safe yourself vs. hiring movers. 1 min – 8 h.
Semantic-similarity sweep: gas-station descriptions from "small convenience store" to "full-service car-care center" for flat-tire repair.
Distance sweep with a capability constraint: carrying a sofa home (where walking is physically infeasible regardless of distance).
If the knowledge is present but not activated, can we make the model self-generate the hint?
// Prepended before the HOB question, no other changes. SYSTEM: "Before answering, list the necessary conditions that must be true for the stated goal to be accomplished. Then answer the question."
We re-evaluate three models spanning the performance range on all 500 HOB instances (N=10 trials each), comparing against zero-shot baselines. Results on the results page.