Why do only two of fourteen models improve when problem constraints are removed?

This explores a striking benchmark result — when you strip the constraints out of an optimization problem, most models get *worse*, not better — and what that reveals about whether models actually reason about constraints or just lean on a default trick.

This reads the question as being about what the "two of fourteen" anomaly tells us: the answer is that those twelve models were never reasoning about constraints in the first place. When constraints are removed, a problem should get *easier* — fewer rules to satisfy. Instead, twelve of fourteen models drop, some by as much as 38.5 percentage points Are models actually reasoning about constraints or just defaulting conservatively?. The cleanest explanation is that most models had been faking competence by defaulting to the harder, more conservative option. When constraints are present, that conservative bias accidentally produces correct-looking answers; remove the constraints and the crutch disappears, exposing that no real evaluation of the problem was ever happening. The two that improve are likely the only two doing genuine constraint reasoning.

This lines up with a broader ceiling the corpus keeps running into. Across constrained-optimization tasks, models plateau around 55–60% constraint satisfaction regardless of size, architecture, or training Do larger language models solve constrained optimization better?, and reasoning-tuned variants with long chains of thought show no consistent edge over standard ones — extended thinking produces more text, not more actual computation Do reasoning models actually beat standard models on optimization?. A flat ceiling that scale doesn't move is the signature of a missing capability, not a tuning gap.

The deepest version of "missing capability" here is architectural. Constraint solving fundamentally depends on *retraction* — trying a partial assignment, discovering it violates a rule, and discarding it. Autoregressive transformers can't un-emit a token, so they have no retraction primitive; symbolic solvers work precisely because they supply what the architecture lacks Why does autoregressive generation fail at constraint satisfaction?. Seen this way, conservative bias isn't laziness — it's the best available substitute for a search procedure the model can't actually run.

What makes this more than a one-off curiosity is that the same "removing something should help but hurts" pattern shows up elsewhere, which is the thing you might not have known you wanted to know. In heuristic-override tasks, stripping out spurious cues *degrades* performance — the opposite of what shortcut-learning theory predicts — because the real challenge is integrating conflicting signals, not filtering distractors Why does removing spurious cues sometimes hurt model performance?. And models can post perfect accuracy while their internal representations are fractured and disorganized, a brittleness that standard metrics never reveal until the input shifts Can models be smart without organized internal structure?. The constraint-removal experiment is a clever instance of a general diagnostic move: perturb the problem in a way that *should* be harmless, and watch which models break. Most do — which tells you the headline scores were measuring the wrong thing.

Sources 6 notes

Are models actually reasoning about constraints or just defaulting conservatively?

Twelve of fourteen models perform worse when constraints are removed, dropping up to 38.5 percentage points. Models appear to reason correctly by defaulting to harder options, not by actually evaluating constraints.

Do larger language models solve constrained optimization better?

Across constrained-optimization tasks, LLMs converge to ~55–60% constraint satisfaction independent of architecture, parameter count, or training regime. Reasoning models do not systematically outperform standard models, suggesting a fundamental ceiling rather than a scaling gap.

Do reasoning models actually beat standard models on optimization?

Reasoning variants with extended CoT show no consistent advantage over standard models on constraint-bound numerical tasks like optimal power flow. Extended thinking produces more text, not more iterative computation, suggesting the bottleneck is numeric procedure rather than reasoning steps.

Why does autoregressive generation fail at constraint satisfaction?

The performance ceiling on constraint satisfaction problems is not a model-quality issue but an architectural limitation: autoregressive transformers cannot retract emitted tokens, while CSP solvers fundamentally depend on discarding invalid partial assignments. Symbolic solver integration works because it supplies what the architecture lacks.

Why does removing spurious cues sometimes hurt model performance?

Removing spurious cues degrades performance in heuristic override tasks, opposite to shortcut learning predictions. The failure mode is integrating conflicting signals rather than ignoring distractors—a frame problem, not feature selection.

Can models be smart without organized internal structure?

Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about LLM constraint reasoning. The question: Why do only 2 of 14 models improve when problem constraints are removed—and what does that tell us about whether models reason about constraints at all?

What a curated library found — and when (dated claims, not current truth): Findings span May 2024–May 2026.

• Twelve of fourteen models drop in performance when constraints are removed, some by up to 38.5 percentage points; the two that improve are likely the only ones doing genuine constraint reasoning, while most rely on conservative bias as a proxy (~2026).
• Across constrained-optimization tasks, models plateau at 55–60% constraint satisfaction regardless of scale, architecture, or training; reasoning-tuned variants show no consistent edge, indicating a missing capability rather than a tuning gap (~2025–2026).
• Autoregressive transformers lack a *retraction* primitive—they cannot un-emit tokens—whereas constraint satisfaction fundamentally requires trying partial assignments, discovering violations, and discarding them; symbolic solvers succeed because they supply what the architecture cannot (~2026).
• In heuristic-override tasks, removing spurious cues *degrades* performance (opposite of shortcut-learning theory), because the real challenge is integrating conflicting signals, not filtering distractors (~2026).
• Models can post perfect accuracy while their internal representations remain fractured and disorganized—a brittleness masked by standard metrics until inputs shift (~2026).

Anchor papers (verify; mind their dates):
• arXiv:2505.20296 (May 2025) – Reasoning LLMs are Wandering Solution Explorers
• arXiv:2603.23004 (Mar 2026) – Can Large Language Models Reason and Optimize Under Constraints?
• arXiv:2603.29025 (Mar 2026) – The Model Says Walk: How Surface Heuristics Override Implicit Constraints in LLM Reasoning
• arXiv:2605.19376 (May 2026) – Generative Recursive Reasoning

Your task:
(1) RE-TEST EACH CONSTRAINT. For the 12-of-14 drop, the plateauing at 55–60%, and the retraction-primitive gap: has architectural innovation (e.g., diffusion-based generation, recursive reasoning, token-level rollback via scaffolding), training method (e.g., critique fine-tuning, RLVR), or evaluation tooling since relaxed or overturned these limits? Separate the durable question ("do LLMs intrinsically reason about constraints?") from the perishable limitation ("current autoregressive transformers cannot retract"). Cite what—if anything—has changed.

(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Has any paper shown 2-of-14 models actually *do* reason robustly, or demonstrated that constraint satisfaction is solvable within the autoregressive regime without external search?

(3) Propose 2 research questions that ASSUME the regime may have moved: e.g., "If recursive reasoning or diffusion-style iterative refinement enables retraction-like behavior, does the 55–60% plateau now shift?" or "Do models fine-tuned on critique-based feedback show genuine constraint integration, or is performance gain still surface-level?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Why do only two of fourteen models improve when problem constraints are removed?

Sources 6 notes

Next inquiring lines