Results — HOB · Heuristic Override Benchmark

Study 2 · HOB benchmark

Strict 10/10 leaderboard

Override accuracy (OA) = fraction of instances where all 10 trials are correct. Impl / Hint: accuracy under implicit vs. one-word-hint explicitness (gap = inference bottleneck). Base / Pair: constraint-active vs. constraint-removed accuracy (Δ < 0 ⇒ conservative bias). Click a column header to sort.

Model	OA (%)	Impl.	Hint	Base	Pair	Δ (Pair−Base)
1Gemini 3.1 Pro	74.6	73.9	86.5	84.5	60.3	−24.2
2Qwen3.5-27B	72.2	69.0	89.2	83.1	53.9	−29.2
3Kimi K2.5	69.0	66.1	83.8	81.7	48.2	−33.5
4Grok 4.2	68.6	65.2	81.1	73.9	66.7	−7.3
5Claude Opus 4.6	68.0	66.4	81.1	81.7	46.8	−34.9
6Claude Sonnet 4.5	66.8	64.9	81.1	78.2	51.8	−26.4
7GPT-5.4	65.8	64.4	78.4	71.8	58.9	−13.0
8GPT-5.2	64.4	60.3	86.5	78.2	40.4	−37.7
9DeepSeek R1	64.2	62.4	73.0	75.4	49.6	−25.7
10GPT-OSS-120B	52.2	48.9	67.6	44.4	58.2	+13.8
11Llama 4 Scout	51.2	48.6	64.9	66.9	28.4	−38.5
12Qwen3-14B	51.2	47.4	54.1	53.5	48.2	−5.3
13GPT-OSS-20B	51.0	46.8	56.8	48.6	59.6	+11.0
14Qwen3-32B	49.6	44.8	59.5	47.9	46.1	−1.8
Mean	62.6	59.2	74.5	69.2	50.9	−18.0

All values reported under strict evaluation (instance correct iff 10 / 10 trials correct). Source: paper Table 1.

Where do models fail?

Mean strict accuracy by family

Across 14 models, C-pres (presence) is universally the hardest constraint family, directly validating the car-wash mechanism at scale. Cost-based heuristics (H-cost) are the easiest to override; proximity (H-prox) and semantic (H-sem) cues are the hardest.

By constraint family

Mean ± range across 14 models. C-pres (44.4%) is hardest; C-cap (71.6%) is easiest.

By heuristic family

Cost cues are easiest to override; proximity and semantic-match cues the hardest.

H × C cell heatmap (14-model mean)

Cells A1 (H-prox × C-pres) and B1 (H-eff × C-pres) are the hardest; several models fall below 30% on these cells.

H × C strict-accuracy heatmap — Mean strict accuracy per H × C cell (14 models). C-pres is hardest; C-cap is easiest. Source: paper Fig. study2_hxc_heatmap.

Diagnostics

Two controlled comparisons

The explicitness gradient shows that failures are in inference, not knowledge. The minimal-pair asymmetry shows that many apparent successes come from conservative bias.

Explicitness gradient (Implicit → Hint)

A single one-word hint recovers +15.3 pp on average. The knowledge is present.

Minimal-pair asymmetry (Base − Pair)

12 / 14 models do worse when the constraint is removed. Only GPT-OSS-120B (+13.8) and GPT-OSS-20B (+11.0) improve on pairs.

Study 1 · Mechanism

The sigmoid signature

All six open models we probed produce sigmoid conflict curves that track the control — a goal-independent mapping from distance to decision.

Monotonicity overlay across six models — Conflict curves (solid) are sigmoids that track the control (dashed grey). No flat curve appears — every model behaves as if it has a shared, goal-independent heuristic.

Heuristic Dominance Ratio

Under the contradict operator, occluding the distance span shifts the decision score by as much as −30.3 logits. Occluding the goal barely moves it — sometimes positively. HDR ranges 8.7× to 38.0×.

Model	Goal Δs	Dist. Δs	HDR
Qwen3-4B	+3.5	−30.3	8.7×
Qwen3-8B	+0.8	−30.3	38.0×
Qwen3-14B	+0.7	−23.8	32.6×
Qwen3-32B	−0.4	−10.8	29.1×
Qwen3.5-27B	+0.8	−7.7	9.3×
GPT-OSS-20B	−0.2	−3.0	14.4×

Span-level occlusion Δs under the contradict operator. HDR = |Δs_dist| / |Δs_goal|.

Cross-model span heatmap — Per-span Δs heatmap across all six models. Distance columns are uniformly blue (toward Drive); goal columns near-zero or positive (toward Walk).

Cross-model HDR comparison — CSI vs. DSI per paraphrase, across models. Goal sensitivity is fragile; distance sensitivity is stable.

Parametric probes

Three patterns, not one

Four probes across six models reveal that the sigmoid failure is not universal — it depends on the H × C combination.

Correct reasoning

H-cost × C-scope

5 of 6 models. Copy-shop vs. courthouse for certified documents. Conflict and control curves stay qualitatively distinct even as cost varies.

Correct reasoning

H-prox × C-cap

4 of 6 models. Carrying a sofa home. The physical-capability constraint is preserved across distances.

Sigmoid failure

H-eff × C-cap

Qwen3-4B recommends personally carrying a 500-lb safe across all time-advantages — a context-independent mapping from efficiency to decision.

Semantic sigmoid

H-sem × C-scope

As gas-station descriptions grow more "car-related," conflict scores transition from correct (mechanic) to incorrect (gas station) — a sigmoid over semantic similarity.

Constraint type matters

Capability constraints (concrete physical properties like weight) are easier to maintain than abstract procedural or scope constraints — consistent with C-cap > C-scope in Study 2.

Model scale helps, partially

On H-eff × C-cap, larger Qwen models (32B, 27B) correctly shift negative; Qwen3-4B stays strongly positive. GPT-OSS-20B hovers near the decision boundary.

Probe summary heatmap — Probe pattern classification across 6 models × 4 probes. Green = correct (curves distinct), yellow = partial, red = sigmoid failure. The efficiency probe shows the most failures; cost and prox-cap show the most correct reasoning.

Mitigation

Goal-decomposition prompting

Prepending a single instruction — "list the necessary conditions for the stated goal, then answer" — recovers +6 to +9 pp on weaker models at zero inference cost.

Non-strict accuracy, base vs. goal-decomposition

Evaluated on all ~500 HOB instances, N=10 per instance, three models spanning the range.

What it shows

Llama 4 Scout improves from 70.3% → 79.3% (+9.0 pp) and GPT-5.4 from 81.7% → 88.0% (+6.3 pp). Gemini 3.1 Pro, already the strongest baseline at 86.0%, shows no significant change (−0.6 pp) — suggesting it already activates constraint reasoning without explicit prompting.

The intervention is most effective on exactly the failure mode our analysis identifies: forcing the model to enumerate preconditions converts an implicit constraint into a self-generated hint, bypassing the inference bottleneck.

This confirms the mechanistic account — the knowledge is present; the bottleneck is in the processing order — and demonstrates a practical, zero-cost intervention for deployed systems.