14 frontier models, ~500 instances, strict 10/10 consistency. Below: the full leaderboard, the H × C heatmap, the explicitness gradient, the minimal-pair asymmetry, mechanistic data from the car-wash case study, and the mitigation experiment.
Override accuracy (OA) = fraction of instances where all 10 trials are correct. Impl / Hint: accuracy under implicit vs. one-word-hint explicitness (gap = inference bottleneck). Base / Pair: constraint-active vs. constraint-removed accuracy (Δ < 0 ⇒ conservative bias). Click a column header to sort.
| Model | OA (%) | Impl. | Hint | Base | Pair | Δ (Pair−Base) |
|---|---|---|---|---|---|---|
| 1Gemini 3.1 Pro | 74.6 | 73.9 | 86.5 | 84.5 | 60.3 | −24.2 |
| 2Qwen3.5-27B | 72.2 | 69.0 | 89.2 | 83.1 | 53.9 | −29.2 |
| 3Kimi K2.5 | 69.0 | 66.1 | 83.8 | 81.7 | 48.2 | −33.5 |
| 4Grok 4.2 | 68.6 | 65.2 | 81.1 | 73.9 | 66.7 | −7.3 |
| 5Claude Opus 4.6 | 68.0 | 66.4 | 81.1 | 81.7 | 46.8 | −34.9 |
| 6Claude Sonnet 4.5 | 66.8 | 64.9 | 81.1 | 78.2 | 51.8 | −26.4 |
| 7GPT-5.4 | 65.8 | 64.4 | 78.4 | 71.8 | 58.9 | −13.0 |
| 8GPT-5.2 | 64.4 | 60.3 | 86.5 | 78.2 | 40.4 | −37.7 |
| 9DeepSeek R1 | 64.2 | 62.4 | 73.0 | 75.4 | 49.6 | −25.7 |
| 10GPT-OSS-120B | 52.2 | 48.9 | 67.6 | 44.4 | 58.2 | +13.8 |
| 11Llama 4 Scout | 51.2 | 48.6 | 64.9 | 66.9 | 28.4 | −38.5 |
| 12Qwen3-14B | 51.2 | 47.4 | 54.1 | 53.5 | 48.2 | −5.3 |
| 13GPT-OSS-20B | 51.0 | 46.8 | 56.8 | 48.6 | 59.6 | +11.0 |
| 14Qwen3-32B | 49.6 | 44.8 | 59.5 | 47.9 | 46.1 | −1.8 |
| Mean | 62.6 | 59.2 | 74.5 | 69.2 | 50.9 | −18.0 |
All values reported under strict evaluation (instance correct iff 10 / 10 trials correct). Source: paper Table 1.
Across 14 models, C-pres (presence) is universally the hardest constraint family, directly validating the car-wash mechanism at scale. Cost-based heuristics (H-cost) are the easiest to override; proximity (H-prox) and semantic (H-sem) cues are the hardest.
Cells A1 (H-prox × C-pres) and B1 (H-eff × C-pres) are the hardest; several models fall below 30% on these cells.
The explicitness gradient shows that failures are in inference, not knowledge. The minimal-pair asymmetry shows that many apparent successes come from conservative bias.
All six open models we probed produce sigmoid conflict curves that track the control — a goal-independent mapping from distance to decision.
Under the contradict operator, occluding the distance span shifts the decision score by as much as −30.3 logits. Occluding the goal barely moves it — sometimes positively. HDR ranges 8.7× to 38.0×.
| Model | Goal Δs | Dist. Δs | HDR |
|---|---|---|---|
| Qwen3-4B | +3.5 | −30.3 | 8.7× |
| Qwen3-8B | +0.8 | −30.3 | 38.0× |
| Qwen3-14B | +0.7 | −23.8 | 32.6× |
| Qwen3-32B | −0.4 | −10.8 | 29.1× |
| Qwen3.5-27B | +0.8 | −7.7 | 9.3× |
| GPT-OSS-20B | −0.2 | −3.0 | 14.4× |
Span-level occlusion Δs under the contradict operator. HDR = |Δsdist| / |Δsgoal|.
Four probes across six models reveal that the sigmoid failure is not universal — it depends on the H × C combination.
5 of 6 models. Copy-shop vs. courthouse for certified documents. Conflict and control curves stay qualitatively distinct even as cost varies.
4 of 6 models. Carrying a sofa home. The physical-capability constraint is preserved across distances.
Qwen3-4B recommends personally carrying a 500-lb safe across all time-advantages — a context-independent mapping from efficiency to decision.
As gas-station descriptions grow more "car-related," conflict scores transition from correct (mechanic) to incorrect (gas station) — a sigmoid over semantic similarity.
Capability constraints (concrete physical properties like weight) are easier to maintain than abstract procedural or scope constraints — consistent with C-cap > C-scope in Study 2.
On H-eff × C-cap, larger Qwen models (32B, 27B) correctly shift negative; Qwen3-4B stays strongly positive. GPT-OSS-20B hovers near the decision boundary.
Prepending a single instruction — "list the necessary conditions for the stated goal, then answer" — recovers +6 to +9 pp on weaker models at zero inference cost.
Llama 4 Scout improves from 70.3% → 79.3% (+9.0 pp) and GPT-5.4 from 81.7% → 88.0% (+6.3 pp). Gemini 3.1 Pro, already the strongest baseline at 86.0%, shows no significant change (−0.6 pp) — suggesting it already activates constraint reasoning without explicit prompting.
The intervention is most effective on exactly the failure mode our analysis identifies: forcing the model to enumerate preconditions converts an implicit constraint into a self-generated hint, bypassing the inference bottleneck.
This confirms the mechanistic account — the knowledge is present; the bottleneck is in the processing order — and demonstrates a practical, zero-cost intervention for deployed systems.