Method

A four-stage framework

Our investigation follows a diagnose–measure–bridge–treat arc: mechanistic analysis of the car-wash failure, systematic benchmarking across heuristic and constraint types, parametric probes testing generality, and a mitigation experiment.

Mechanistic analysis of the car-wash problem

We ask what the model relies on (span-level occlusion) and how it uses it (monotonicity curves).

Decision score

We define a scalar decision score extracted via anchored teacher-forced scoring:

// Higher = model prefers WALK; lower = model prefers DRIVE
s(x) = log p("Walk" | x) - log p("Drive" | x)

A fixed anchor (\nFinal:) is appended after the generation prefix to create a deterministic scoring position. For multi-token candidates, log-probabilities are aggregated via log-sum-exp across tokenisation variants. Scoring is exactly reproducible.

Causal occlusion

To identify which input component drives the decision, we perturb each span independently and measure the change in decision score:

A(z) = s(occ(x, z)) − s(x)

We apply three operators — mask neutral contradict — at three levels (sentence, span, token) and require agreement across all three to control for out-of-distribution artefacts.

HDR

|A(H)| / |A(G)|

Heuristic Dominance Ratio. HDR > 1 ⇒ heuristic more influential than goal.

CSI

|A(G)|

Constraint Sensitivity Index — how much the model's decision moves when the goal span is perturbed.

DSI

|A(H)|

Distance Sensitivity Index — how much the decision moves when the heuristic span is perturbed.

Monotonicity curves

We sweep distance d over 14 log-spaced values (10 m – 100 km) in a conflict condition (car wash: Drive always correct) and a control condition (coffee shop: answer depends on distance), sampling T=5 from 7 templates per point (140 prompts per model). Correct reasoning produces a flat conflict curve and a sigmoid control; a pure heuristic produces two near-identical sigmoids.

The HOB benchmark

HOB is organised along two dimensions: 4 heuristic families × 5 constraint families, yielding 15 populated cells across ~500 instances and 7 domains.

Heuristic families

FamilyPatternTypical cues
H-prox ProximityCloser → better"5 min away," "next door"
H-eff EfficiencyFaster → better"quickest way," "saves time"
H-cost CostCheaper → better"free option," "saves money"
H-sem SemanticName sounds right → viable"gas station" for tires

Constraint families

FamilyDefinitionExample
C-pres PresenceObject must be at destinationCar must be at the car wash
C-cap CapabilityMeans cannot do the taskCan't carry a sofa on foot
C-val ValidityPrecondition is violatedCan't drive with a flat tire
C-scope ScopeService can't fulfil the goalGas station won't fix tires
C-proc ProceduralStep or timing not metStore already closed

Design principles

Minimal pairs

Every instance has a matched variant where the constraint is removed (e.g., "get my car washed" → "pick up a car-wash gift card"), isolating constraint reasoning from surface comprehension.

Explicitness gradient

Each instance is rendered at three levels: implicit (no cue), hint (one extra word), and explicit — a controlled measure of the inference bottleneck.

Heuristic strength

Three graded versions (strong / medium / weak) let us measure how reliably the heuristic can be overcome across cue salience.

Parametric probes — does the mechanism generalise?

We extend the parametric sweep framework to three additional H × C combinations, testing all six Study 1 models with T=10 trials per grid point (840 prompts per model).

H-cost × C-scope

Cost sweep: copy shop vs. courthouse for certified documents. \$0–\$500.

H-eff × C-cap

Time-advantage sweep: carrying a 500-lb safe yourself vs. hiring movers. 1 min – 8 h.

H-sem × C-scope

Semantic-similarity sweep: gas-station descriptions from "small convenience store" to "full-service car-care center" for flat-tire repair.

H-prox × C-cap

Distance sweep with a capability constraint: carrying a sofa home (where walking is physically infeasible regardless of distance).

Goal-decomposition prompting

If the knowledge is present but not activated, can we make the model self-generate the hint?

// Prepended before the HOB question, no other changes.
SYSTEM: "Before answering, list the necessary conditions that must be
          true for the stated goal to be accomplished. Then answer the question."

We re-evaluate three models spanning the performance range on all 500 HOB instances (N=10 trials each), comparing against zero-shot baselines. Results on the results page.