Why do reasoning models fail to improve constrained optimization performance?
This explores why models with extended chain-of-thought reasoning don't do better on problems with hard numerical constraints (like optimization tasks), and the corpus points to a single answer: the bottleneck isn't reasoning — it's execution.
This explores why reasoning models — the ones trained to 'think longer' before answering — fail to improve on constrained optimization, the class of problems where you have to satisfy hard numerical limits (like balancing a power grid). The corpus is unusually unanimous here, and the punchline is counterintuitive: the failure has almost nothing to do with reasoning quality. Across architecture, scale, and training regime, LLMs plateau at roughly 55–60% constraint satisfaction Do larger language models solve constrained optimization better?, and reasoning variants don't systematically beat standard models on these numerical tasks Do reasoning models actually beat standard models on optimization?. Extended thinking produces more text, not more computation — which is the first clue.
The deeper diagnosis is that these models can't actually *run* iterative numerical procedures. They recognize an optimization problem as template-similar to ones they've seen and emit a plausible-looking answer, rather than executing the step-by-step method that would converge on a feasible solution Do large language models actually perform iterative optimization?. One note reframes the famous 'reasoning cliff' entirely: model collapses on hard problems are *execution* failures, not reasoning failures — the model often knows the algorithm but can't carry it out at scale in text-only generation. Give it tools, and problems beyond the supposed cliff become solvable Are reasoning model collapses really failures of reasoning?. That's a strong hint that the missing ingredient is procedural bandwidth, not smarter thinking.
This is why training-based fixes keep missing. Supervised fine-tuning makes outputs *look* correct — clean JSON, valid identifiers, the right sections — without making them physically feasible; the model learns the surface of a solution, not how to build one Does supervised fine-tuning actually improve reasoning on optimization problems?. RL fine-tuning is no better: it sharpens memorization, and the moment you test on slightly-shifted out-of-distribution variants, performance drops sharply, revealing template-matching rather than a learned procedure Do fine-tuned language models actually learn optimization procedures?. Even frontier reasoning models like o1-preview and DeepSeek-R1 hit only ~20–23% on constraint-satisfaction problems that demand genuine backtracking — fluency at reflection doesn't translate into competence on unfamiliar instance structures Can reasoning models actually sustain long-chain reflection?.
There's a second, more behavioral failure mode layered on top. When reasoning models *do* explore, they explore badly — wandering down invalid paths and abandoning promising ones prematurely ('underthinking'), so success probability collapses exponentially as problems get deeper Why do reasoning models abandon promising solution paths? Why do reasoning LLMs fail at deeper problem solving?. Intriguingly, simple decoding-level nudges (like penalizing rapid thought-switching) recover accuracy without any retraining — meaning viable solutions were inside the model all along, just discarded. And there's a tension worth noticing: scaling reasoning can actively *hurt*, because longer chains create contextual distance that dilutes attention to the original instructions and constraints Why do better reasoning models ignore instructions?.
The surprise here — the thing you didn't know you wanted to know — is that 'reason harder' and 'reason better' are aimed at the wrong target for constrained optimization. The corpus suggests the real levers are external execution (tools, solvers) and better search discipline at decode time, not more or longer thinking. One contrarian thread even points elsewhere entirely: energy-based transformers, which treat inference as gradient-descent minimization toward a low-energy answer, post larger inference-compute gains and better out-of-distribution generalization — hinting that the fix might be architectural, building the iterative *optimization step* into how the model computes rather than asking a text generator to fake it Can energy minimization unlock reasoning without domain-specific training?. Worth flagging: not every note agrees reasoning is useless — one argues reasoning models genuinely outperform on other task families because training installs a productive protocol Can non-reasoning models catch up with more compute? — which sharpens the real claim: reasoning helps where the bottleneck is *deliberation*, and stalls where the bottleneck is *numerical execution*.
Sources 12 notes
Across constrained-optimization tasks, LLMs converge to ~55–60% constraint satisfaction independent of architecture, parameter count, or training regime. Reasoning models do not systematically outperform standard models, suggesting a fundamental ceiling rather than a scaling gap.
Reasoning variants with extended CoT show no consistent advantage over standard models on constraint-bound numerical tasks like optimal power flow. Extended thinking produces more text, not more iterative computation, suggesting the bottleneck is numeric procedure rather than reasoning steps.
Research shows LLMs cannot perform iterative procedures in latent space. They recognize optimization problems as template-similar and emit plausible-looking but incorrect values, a failure mode that persists across model scale and training approaches.
Models confined to text-only generation cannot execute multi-step procedures at scale, even when they know the underlying algorithm. Tool-enabled models solve problems beyond the supposed reasoning cliff, suggesting the bottleneck is procedural execution bandwidth.
Supervised fine-tuning makes model outputs look correct—proper JSON structure, valid identifiers, expected sections—without making them physically feasible. The model learns surface features of solutions, not the reasoning to construct valid ones.
Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.
DeepSeek-R1 and o1-preview achieve only 20-23.6% exact match on 850 constraint satisfaction problems requiring genuine backtracking. This ceiling reveals that reflective reasoning fluency does not translate to actual problem-solving competence on unfamiliar instance structures.
Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.
Current reasoning models lack the three properties of systematic exploration: validity, effectiveness, and necessity. This causes success probability to drop exponentially with problem depth, making medium problems solvable but deep problems catastrophically harder.
The MathIF benchmark shows that SFT and RL training improve reasoning but reduce instruction adherence, particularly as chain-of-thought length increases. Longer reasoning chains create contextual distance that dilutes the model's attention to original instructions.
Energy-Based Transformers assign energy values to input-prediction pairs and use gradient descent minimization for inference, yielding 35% higher training scaling rates and 29% more inference-compute gains than Transformer++, while generalizing better on out-of-distribution data without domain-specific scaffolding.
Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.