Do reasoning models actually beat standard models on optimization?
Explores whether extended chain-of-thought in reasoning models delivers performance gains on constraint-satisfaction problems like power-grid optimization. Matters because reasoning models are treated as automatic upgrades, but the evidence may not support that claim.
Reasoning models have been treated as a generalized capability upgrade — more thinking tokens at test time, broadly better performance. On constraint-bound numerical optimization the upgrade does not materialize. Reasoning variants do not systematically outperform their non-reasoning counterparts on power-grid, financial-operations, or cyber-security feasibility problems. The longer trace does not become a longer iteration.
The reason this matters: extended chain-of-thought looks like it should help. The problem involves multi-step arithmetic, interacting constraints, and convergence-style reasoning — exactly the regime where "think more" is supposed to pay. The data say it does not. Whatever extended CoT is doing on these tasks, it is not running a Newton-Raphson iteration or a primal-dual update in latent space; it is producing more text without producing more computation.
This is consistent with a growing view that reasoning models excel where the bottleneck is exploration over reasoning paths (math contests, code, multi-hop QA) but stall where the bottleneck is numeric procedure. Constraint satisfaction over real physical systems is the latter. Adding chain length adds search over verbal restatements of the problem, not iterations of the algorithm that would solve it.
The implication for product: choosing "reasoning model" for an optimization-heavy workflow is not automatically the right call. The relevant decision is whether the bottleneck is verbal reasoning or numeric computation. If numeric, the cost-effective path is hand-off to a solver, not more thinking tokens.
Inquiring lines that use this note as a source 56
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Why do only two of fourteen models improve when problem constraints are removed?
- When does the right constraint beat additional model capacity?
- What production constraints should determine paradigm selection?
- How do unstated feasibility constraints affect model decision-making?
- What design changes could make constraint inference more reliable without explicit cuing?
- Can explicit constraint statements override the dominance of surface heuristics?
- How does step-level compute allocation compare to response-level thinking?
- How does nesting optimization levels improve on traditional network depth?
- What explains the 87 percent to 12 percent cliff in plan executability?
- What advantages emerge from running 13 times more parallel reasoning chains with the same budget?
- Do tool-enabled reasoning models close the gap on constraint satisfaction?
- Can chain-of-thought explanations be both sufficient and necessary for model decisions?
- Can external verifiers replace reasoning trace quality in solution guarantees?
- Why does most refinement in iterative models maintain answers rather than improve them?
- Why do non-reasoning models work better under extreme decomposition than reasoning models?
- Does architectural design matter more than model scale for reasoning tasks?
- How does bottleneck automation differ from accessory work displacement?
- Can parallel reasoning chains outperform longer sequential chains with the same compute?
- How does Goodhart's Law apply when safety measures become optimization targets?
- When does sequential reasoning provide exponential advantages over parallel voting?
- When does sequential chain-of-thought dramatically beat parallel voting approaches?
- Can prompt engineering improve reasoning or only move requests into denser regions?
- Why do production systems optimize for three model classes instead of foundation models?
- Can explicit optimal algorithms prevent reasoning model collapse at high complexity?
- What makes multi-paradigm chaining a distinct reasoning topology?
- Is the reasoning cliff actually a tool-use problem?
- What makes constraint satisfaction problems epistemically cleaner than other reasoning tasks?
- Can optimization algorithms exploit the shift between procedural and planning bottlenecks?
- Do higher asymptote recipes unlock genuinely novel reasoning strategies?
- Which constraint types do reasoning models handle best?
- How do random walk reasoning chains from knowledge graphs compare to traditional fine-tuning?
- Why does augmenting symbolic reasoning outperform replacing it entirely?
- How does symbolic solver feedback differ from language-based self-critique?
- Can the LLM-Modulo framework extend solver integration to domain planning?
- Why might diverse smaller models with routing beat one giant model?
- Can static reasoning patterns work better than dynamic branch selection?
- When is detailed step-by-step reasoning actually counterproductive for solving a problem?
- Can weaker models match stronger ones with sufficient search and reasoning budget?
- What distinguishes intrinsic search from extrinsic search method approaches?
- How should benchmarks evaluate workflow architecture versus raw model performance?
- Why does extended chain-of-thought reasoning fail to improve numerical optimization performance?
- Why do reasoning models fail to improve constrained optimization performance?
- What planning strategies reduce execution steps without sacrificing solution quality?
- How does making implicit reasoning requirements explicit change model performance?
- How does planning-before-execution compare to iterative reasoning and action loops?
- What limits external scaling when a model lacks reasoning foundation?
- How do KV cache pruning and subproblem contraction both free reasoning capacity?
- Why do macro and micro forecasting scales require different reasoning approaches?
- What real-world forecasting domains benefit most from contextual reasoning integration?
- Can symbolic solvers reliably replace LLM reasoning for logical tasks?
- What makes deterministic recursive reasoning models underperform on multi-solution tasks?
- Do reasoning benchmarks predict real performance in long delegated workflows?
- Why does prompt optimization alone fail to inject genuinely new knowledge?
- How do search and reasoning workflows improve forecasting performance over base models?
- What benefits do open foundation models create that closed systems cannot?
- Can architectural changes reduce representational inequality in unified generators?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Do larger language models solve constrained optimization better?
Explores whether scaling LLMs—through more parameters, better training, or reasoning extensions—improves their ability to satisfy constraints in real optimization problems like power grids and portfolios.
same paper, the parent finding
-
Do large language models actually perform iterative optimization?
Explores whether LLMs execute genuine numerical procedures like Newton-Raphson or instead pattern-match to memorized solution templates when solving constrained optimization problems.
same paper, the mechanism
-
Does more thinking time always improve reasoning accuracy?
Explores whether extending a model's thinking tokens linearly improves performance, or if there's a point beyond which additional reasoning becomes counterproductive.
adjacent: CoT ceilings in general
-
Why does chain of thought accuracy eventually decline with length?
Explores why longer reasoning chains don't always improve answers, and how the optimal length shifts based on task difficulty and model capability.
adjacent: more thinking is not monotonically better
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Can Large Language Models Reason and Optimize Under Constraints?
- The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
- Chain of Thoughtlessness? An Analysis of CoT in Planning
- Comment on The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
- On the Reasoning Capacity of AI Models and How to Quantify It
- A Comment On "The Illusion of Thinking": Reframing the Reasoning Cliff as an Agentic Gap
- Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
- Divide-or-Conquer? Which Part Should You Distill Your LLM?
Original note title
reasoning models do not systematically outperform non-reasoning models on real numerical optimization — extended chain-of-thought is not a substitute for iterative computation