Results

Leaderboard, probes, and mitigation

14 frontier models, ~500 instances, strict 10/10 consistency. Below: the full leaderboard, the H × C heatmap, the explicitness gradient, the minimal-pair asymmetry, mechanistic data from the car-wash case study, and the mitigation experiment.

Strict 10/10 leaderboard

Override accuracy (OA) = fraction of instances where all 10 trials are correct. Impl / Hint: accuracy under implicit vs. one-word-hint explicitness (gap = inference bottleneck). Base / Pair: constraint-active vs. constraint-removed accuracy (Δ < 0 ⇒ conservative bias). Click a column header to sort.

Model OA (%) Impl. Hint Base Pair Δ (Pair−Base)
1Gemini 3.1 Pro 74.673.986.584.560.3−24.2
2Qwen3.5-27B 72.269.089.283.153.9−29.2
3Kimi K2.5 69.066.183.881.748.2−33.5
4Grok 4.2 68.665.281.173.966.7−7.3
5Claude Opus 4.6 68.066.481.181.746.8−34.9
6Claude Sonnet 4.5 66.864.981.178.251.8−26.4
7GPT-5.4 65.864.478.471.858.9−13.0
8GPT-5.2 64.460.386.578.240.4−37.7
9DeepSeek R1 64.262.473.075.449.6−25.7
10GPT-OSS-120B 52.248.967.644.458.2+13.8
11Llama 4 Scout 51.248.664.966.928.4−38.5
12Qwen3-14B 51.247.454.153.548.2−5.3
13GPT-OSS-20B 51.046.856.848.659.6+11.0
14Qwen3-32B 49.644.859.547.946.1−1.8
Mean62.659.274.569.250.9−18.0

All values reported under strict evaluation (instance correct iff 10 / 10 trials correct). Source: paper Table 1.

Mean strict accuracy by family

Across 14 models, C-pres (presence) is universally the hardest constraint family, directly validating the car-wash mechanism at scale. Cost-based heuristics (H-cost) are the easiest to override; proximity (H-prox) and semantic (H-sem) cues are the hardest.

By constraint family

Mean ± range across 14 models. C-pres (44.4%) is hardest; C-cap (71.6%) is easiest.

By heuristic family

Cost cues are easiest to override; proximity and semantic-match cues the hardest.

H × C cell heatmap (14-model mean)

Cells A1 (H-prox × C-pres) and B1 (H-eff × C-pres) are the hardest; several models fall below 30% on these cells.

H × C strict-accuracy heatmap
Mean strict accuracy per H × C cell (14 models). C-pres is hardest; C-cap is easiest. Source: paper Fig. study2_hxc_heatmap.

Two controlled comparisons

The explicitness gradient shows that failures are in inference, not knowledge. The minimal-pair asymmetry shows that many apparent successes come from conservative bias.

Explicitness gradient (Implicit → Hint)

A single one-word hint recovers +15.3 pp on average. The knowledge is present.

Minimal-pair asymmetry (Base − Pair)

12 / 14 models do worse when the constraint is removed. Only GPT-OSS-120B (+13.8) and GPT-OSS-20B (+11.0) improve on pairs.

The sigmoid signature

All six open models we probed produce sigmoid conflict curves that track the control — a goal-independent mapping from distance to decision.

Monotonicity overlay across six models
Conflict curves (solid) are sigmoids that track the control (dashed grey). No flat curve appears — every model behaves as if it has a shared, goal-independent heuristic.

Heuristic Dominance Ratio

Under the contradict operator, occluding the distance span shifts the decision score by as much as −30.3 logits. Occluding the goal barely moves it — sometimes positively. HDR ranges 8.7× to 38.0×.

ModelGoal ΔsDist. ΔsHDR
Qwen3-4B +3.5 −30.38.7×
Qwen3-8B +0.8 −30.338.0×
Qwen3-14B +0.7 −23.832.6×
Qwen3-32B −0.4 −10.829.1×
Qwen3.5-27B+0.8 −7.7 9.3×
GPT-OSS-20B−0.2 −3.0 14.4×

Span-level occlusion Δs under the contradict operator. HDR = |Δsdist| / |Δsgoal|.

Cross-model span heatmap
Per-span Δs heatmap across all six models. Distance columns are uniformly blue (toward Drive); goal columns near-zero or positive (toward Walk).
Cross-model HDR comparison
CSI vs. DSI per paraphrase, across models. Goal sensitivity is fragile; distance sensitivity is stable.

Three patterns, not one

Four probes across six models reveal that the sigmoid failure is not universal — it depends on the H × C combination.

Correct reasoning

H-cost × C-scope

5 of 6 models. Copy-shop vs. courthouse for certified documents. Conflict and control curves stay qualitatively distinct even as cost varies.

Correct reasoning

H-prox × C-cap

4 of 6 models. Carrying a sofa home. The physical-capability constraint is preserved across distances.

Sigmoid failure

H-eff × C-cap

Qwen3-4B recommends personally carrying a 500-lb safe across all time-advantages — a context-independent mapping from efficiency to decision.

Semantic sigmoid

H-sem × C-scope

As gas-station descriptions grow more "car-related," conflict scores transition from correct (mechanic) to incorrect (gas station) — a sigmoid over semantic similarity.

Constraint type matters

Capability constraints (concrete physical properties like weight) are easier to maintain than abstract procedural or scope constraints — consistent with C-cap > C-scope in Study 2.

Model scale helps, partially

On H-eff × C-cap, larger Qwen models (32B, 27B) correctly shift negative; Qwen3-4B stays strongly positive. GPT-OSS-20B hovers near the decision boundary.

Probe summary heatmap
Probe pattern classification across 6 models × 4 probes. Green = correct (curves distinct), yellow = partial, red = sigmoid failure. The efficiency probe shows the most failures; cost and prox-cap show the most correct reasoning.

Goal-decomposition prompting

Prepending a single instruction — "list the necessary conditions for the stated goal, then answer" — recovers +6 to +9 pp on weaker models at zero inference cost.

Non-strict accuracy, base vs. goal-decomposition

Evaluated on all ~500 HOB instances, N=10 per instance, three models spanning the range.

What it shows

Llama 4 Scout improves from 70.3% → 79.3% (+9.0 pp) and GPT-5.4 from 81.7% → 88.0% (+6.3 pp). Gemini 3.1 Pro, already the strongest baseline at 86.0%, shows no significant change (−0.6 pp) — suggesting it already activates constraint reasoning without explicit prompting.

The intervention is most effective on exactly the failure mode our analysis identifies: forcing the model to enumerate preconditions converts an implicit constraint into a self-generated hint, bypassing the inference bottleneck.

This confirms the mechanistic account — the knowledge is present; the bottleneck is in the processing order — and demonstrates a practical, zero-cost intervention for deployed systems.