Method — HOB · Heuristic Override Benchmark

01 · Diagnose

Mechanistic analysis of the car-wash problem

We ask what the model relies on (span-level occlusion) and how it uses it (monotonicity curves).

Decision score

We define a scalar decision score extracted via anchored teacher-forced scoring:

// Higher = model prefers WALK; lower = model prefers DRIVE
s(x) = log p("Walk" | x) - log p("Drive" | x)

A fixed anchor (\nFinal:) is appended after the generation prefix to create a deterministic scoring position. For multi-token candidates, log-probabilities are aggregated via log-sum-exp across tokenisation variants. Scoring is exactly reproducible.

Causal occlusion

To identify which input component drives the decision, we perturb each span independently and measure the change in decision score:

A(z) = s(occ(x, z)) − s(x)

We apply three operators — mask neutral contradict — at three levels (sentence, span, token) and require agreement across all three to control for out-of-distribution artefacts.

HDR

|A(H)| / |A(G)|

Heuristic Dominance Ratio. HDR > 1 ⇒ heuristic more influential than goal.

CSI

|A(G)|

Constraint Sensitivity Index — how much the model's decision moves when the goal span is perturbed.

DSI

|A(H)|

Distance Sensitivity Index — how much the decision moves when the heuristic span is perturbed.

Monotonicity curves

We sweep distance d over 14 log-spaced values (10 m – 100 km) in a conflict condition (car wash: Drive always correct) and a control condition (coffee shop: answer depends on distance), sampling T=5 from 7 templates per point (140 prompts per model). Correct reasoning produces a flat conflict curve and a sigmoid control; a pure heuristic produces two near-identical sigmoids.

02 · Measure

The HOB benchmark

HOB is organised along two dimensions: 4 heuristic families × 5 constraint families, yielding 15 populated cells across ~500 instances and 7 domains.

Heuristic families

Family	Pattern	Typical cues
H-prox Proximity	Closer → better	"5 min away," "next door"
H-eff Efficiency	Faster → better	"quickest way," "saves time"
H-cost Cost	Cheaper → better	"free option," "saves money"
H-sem Semantic	Name sounds right → viable	"gas station" for tires

Constraint families

Family	Definition	Example
C-pres Presence	Object must be at destination	Car must be at the car wash
C-cap Capability	Means cannot do the task	Can't carry a sofa on foot
C-val Validity	Precondition is violated	Can't drive with a flat tire
C-scope Scope	Service can't fulfil the goal	Gas station won't fix tires
C-proc Procedural	Step or timing not met	Store already closed

Design principles

Minimal pairs

Every instance has a matched variant where the constraint is removed (e.g., "get my car washed" → "pick up a car-wash gift card"), isolating constraint reasoning from surface comprehension.

Explicitness gradient

Each instance is rendered at three levels: implicit (no cue), hint (one extra word), and explicit — a controlled measure of the inference bottleneck.

Heuristic strength

Three graded versions (strong / medium / weak) let us measure how reliably the heuristic can be overcome across cue salience.

03 · Bridge

Parametric probes — does the mechanism generalise?

We extend the parametric sweep framework to three additional H × C combinations, testing all six Study 1 models with T=10 trials per grid point (840 prompts per model).

H-cost × C-scope

Cost sweep: copy shop vs. courthouse for certified documents. \$0–\$500.

H-eff × C-cap

Time-advantage sweep: carrying a 500-lb safe yourself vs. hiring movers. 1 min – 8 h.

H-sem × C-scope

Semantic-similarity sweep: gas-station descriptions from "small convenience store" to "full-service car-care center" for flat-tire repair.

H-prox × C-cap

Distance sweep with a capability constraint: carrying a sofa home (where walking is physically infeasible regardless of distance).

04 · Treat

Goal-decomposition prompting

If the knowledge is present but not activated, can we make the model self-generate the hint?

// Prepended before the HOB question, no other changes.
SYSTEM: "Before answering, list the necessary conditions that must be
          true for the stated goal to be accomplished. Then answer the question."

We re-evaluate three models spanning the performance range on all 500 HOB instances (N=10 trials each), comparing against zero-shot baselines. Results on the results page.