Why do only two of fourteen models improve when problem constraints are removed?
This explores a striking benchmark result — when you strip the constraints out of an optimization problem, most models get *worse*, not better — and what that reveals about whether models actually reason about constraints or just lean on a default trick.
This reads the question as being about what the "two of fourteen" anomaly tells us: the answer is that those twelve models were never reasoning about constraints in the first place. When constraints are removed, a problem should get *easier* — fewer rules to satisfy. Instead, twelve of fourteen models drop, some by as much as 38.5 percentage points Are models actually reasoning about constraints or just defaulting conservatively?. The cleanest explanation is that most models had been faking competence by defaulting to the harder, more conservative option. When constraints are present, that conservative bias accidentally produces correct-looking answers; remove the constraints and the crutch disappears, exposing that no real evaluation of the problem was ever happening. The two that improve are likely the only two doing genuine constraint reasoning.
This lines up with a broader ceiling the corpus keeps running into. Across constrained-optimization tasks, models plateau around 55–60% constraint satisfaction regardless of size, architecture, or training Do larger language models solve constrained optimization better?, and reasoning-tuned variants with long chains of thought show no consistent edge over standard ones — extended thinking produces more text, not more actual computation Do reasoning models actually beat standard models on optimization?. A flat ceiling that scale doesn't move is the signature of a missing capability, not a tuning gap.
The deepest version of "missing capability" here is architectural. Constraint solving fundamentally depends on *retraction* — trying a partial assignment, discovering it violates a rule, and discarding it. Autoregressive transformers can't un-emit a token, so they have no retraction primitive; symbolic solvers work precisely because they supply what the architecture lacks Why does autoregressive generation fail at constraint satisfaction?. Seen this way, conservative bias isn't laziness — it's the best available substitute for a search procedure the model can't actually run.
What makes this more than a one-off curiosity is that the same "removing something should help but hurts" pattern shows up elsewhere, which is the thing you might not have known you wanted to know. In heuristic-override tasks, stripping out spurious cues *degrades* performance — the opposite of what shortcut-learning theory predicts — because the real challenge is integrating conflicting signals, not filtering distractors Why does removing spurious cues sometimes hurt model performance?. And models can post perfect accuracy while their internal representations are fractured and disorganized, a brittleness that standard metrics never reveal until the input shifts Can models be smart without organized internal structure?. The constraint-removal experiment is a clever instance of a general diagnostic move: perturb the problem in a way that *should* be harmless, and watch which models break. Most do — which tells you the headline scores were measuring the wrong thing.
Sources 6 notes
Twelve of fourteen models perform worse when constraints are removed, dropping up to 38.5 percentage points. Models appear to reason correctly by defaulting to harder options, not by actually evaluating constraints.
Across constrained-optimization tasks, LLMs converge to ~55–60% constraint satisfaction independent of architecture, parameter count, or training regime. Reasoning models do not systematically outperform standard models, suggesting a fundamental ceiling rather than a scaling gap.
Reasoning variants with extended CoT show no consistent advantage over standard models on constraint-bound numerical tasks like optimal power flow. Extended thinking produces more text, not more iterative computation, suggesting the bottleneck is numeric procedure rather than reasoning steps.
The performance ceiling on constraint satisfaction problems is not a model-quality issue but an architectural limitation: autoregressive transformers cannot retract emitted tokens, while CSP solvers fundamentally depend on discarding invalid partial assignments. Symbolic solver integration works because it supplies what the architecture lacks.
Removing spurious cues degrades performance in heuristic override tasks, opposite to shortcut learning predictions. The failure mode is integrating conflicting signals rather than ignoring distractors—a frame problem, not feature selection.
Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.