Do larger language models solve constrained optimization better?
Explores whether scaling LLMs—through more parameters, better training, or reasoning extensions—improves their ability to satisfy constraints in real optimization problems like power grids and portfolios.
When evaluated on real constrained-optimization problems — optimal power flow, financial portfolio constraints, cyber-security feasibility — LLMs cluster around 55-60% constraint satisfaction across virtually all conditions tested. The plateau is robust to changes in architecture, parameter count, and training regime. Reasoning models, despite extended chain-of-thought, do not systematically beat their non-reasoning counterparts on these tasks.
The flatness of the plateau is the finding. Most LLM capability work assumes that the relevant axis is performance vs scale, and that closing a gap is a matter of training on more or better data. Constrained optimization does not behave that way. The benchmark distinguishes problems that require jointly interpreting structured input, doing multi-step arithmetic, satisfying interacting physical constraints, and converging to feasible solutions. On the joint task, the model class itself appears to be near a ceiling.
This is distinct from general reasoning benchmarks (MMLU, GPQA) and from logical reasoning benchmarks (ARC-AGI, SATBench, ZebraLogic). Those measure either broad knowledge or synthetic constraint puzzles. Real engineering optimization requires the model to execute iterative numerical procedures over physical constraints, and that procedural execution is where the plateau lives.
The deployment implication is sharp: telling executives that "LLMs will optimize the grid" or "LLMs will solve constrained portfolio problems" is currently an overclaim. The same finding suggests the productive direction is not "wait for the next model" but "change the paradigm" — restrict the LLM to abstraction tasks and hand numeric work to solvers.
Inquiring lines that use this note as a source 110
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Can communication problems and optimization problems be addressed with the same alignment approaches?
- Why do only two of fourteen models improve when problem constraints are removed?
- When does the right constraint beat additional model capacity?
- Can closed-form solutions compete with gradient descent optimization?
- What structural constraints matter more than model depth for CF?
- Can surface heuristics override implicit constraints in domain-specific reasoning?
- How do unstated constraints become invisible to training data distributions?
- What production constraints should determine paradigm selection?
- How do unstated feasibility constraints affect model decision-making?
- Can routing enable heterogeneous SLM-first architectures at scale?
- How do cost-efficient LLM models compare to high-performance ones in recommendation?
- How do constrained versus unconstrained domains flip LLM novelty patterns?
- Can explicit constraint statements override the dominance of surface heuristics?
- Can universal function approximators be expensive to learn in practice?
- Why do intermediate LLM layers become more precise in frontier models?
- How does nesting optimization levels improve on traditional network depth?
- Can adaptive compute allocation at sub-token granularity improve cross-lingual robustness?
- Why do standard accuracy metrics ignore set-level consumption constraints?
- Does scaling model size solve compositional generalization problems?
- What explains the 87 percent to 12 percent cliff in plan executability?
- How does inference compute substitution affect the training parameter scaling trade-off?
- Do tool-enabled reasoning models close the gap on constraint satisfaction?
- Would hybrid systems combining LLMs with symbolic solvers overcome the retraction limitation?
- What scaling behavior do partial systems show without iterative query refinement?
- Why do scaling laws fail to predict optimal architectures at small parameter counts?
- Why do power-law distributions make standard ML infrastructure assumptions fail?
- Why do energy-based models generalize better on out-of-distribution data than standard transformers?
- Does the optimal model size depend on what capabilities you actually need?
- Can smaller models actually perform well on specific downstream tasks?
- How does fitness-proportional selection guide LLM recombination in unstructured solution spaces?
- What decomposition level minimizes both error rate and computational cost in practice?
- Can prompt optimization inject new knowledge into language models?
- Why do task-specific heuristics fail at generalizing to sparse data regions?
- How do LLMs compress specific expert knowledge into median abstraction?
- Does scaling data automatically produce compositional reasoning or just better feature encoding?
- How should inference budget adapt based on problem difficulty?
- Can smaller specialist models outperform large generalist models on domain tasks?
- Why does adjusted compression performance degrade as models scale larger?
- Does trading model size for inference steps improve overall efficiency scaling?
- How do general language model benchmarks predict specialized domain performance?
- Do standard language benchmarks underestimate what LLMs can actually do?
- Why do standard NLP benchmarks hide the most critical language limitations?
- Do LLMs fail exploration because of context integration or computational limitations?
- How does structural complexity affect LLM performance differently than inferential complexity?
- Can LLMs improve at simple deduction through different training approaches?
- What makes certain bond distributions more learnable than others?
- Why does genetic programming outperform direct LLM generation by 86 percent?
- How does the Ladder of Scales approach reduce search costs across model sizes?
- Do latent communication approaches truly escape token economics constraints?
- Why do different LLMs converge on nearly identical outputs?
- Can LLMs recover true joint distributions from marginal census data?
- Why do production systems optimize for three model classes instead of foundation models?
- What formal language complexity level matches transformer computational limits best?
- How do language agents become optimizable computational graphs automatically?
- How does constraint complexity relate to optimal reasoning token budgets?
- Can optimization algorithms exploit the shift between procedural and planning bottlenecks?
- Do small models show different parameter efficiency patterns than large models?
- How should tiny language models be architected differently than large ones?
- What planning tasks benefit most from combining LLM generation with external verification?
- Can the LLM-Modulo framework extend solver integration to domain planning?
- Can compute allocation and model routing be combined for better results?
- What makes routing a better investment than training larger models?
- Can scaling alone create compositional generalization without explicit binding mechanisms?
- Can weaker models match stronger ones with sufficient search and reasoning budget?
- Can a model be strong at MMLU but weak at long-horizon tasks?
- Does structured decomposition improve LLM reasoning in other compound tasks?
- Why do smaller LLMs fail at zero-shot argument scheme classification?
- Why does extended chain-of-thought reasoning fail to improve numerical optimization performance?
- What mechanism causes LLMs to plateau on numerical optimization tasks?
- Why do reasoning models fail to improve constrained optimization performance?
- Can LLMs successfully translate natural language into formal solver specifications?
- How should organizations redesign workflows if LLMs cannot solve optimization directly?
- What concrete problems do LLMs solve at the computational level?
- Why do language models plateau at 55 to 60 percent constraint satisfaction?
- Why do LLMs fail at directly solving stochastic control problems?
- Why do language models fail at iterative numerical optimization despite scale?
- What makes natural-language APIs particularly suited to LLM-based simulation?
- What limits the effectiveness of formal language pretraining on transformer architectures?
- What planning strategies reduce execution steps without sacrificing solution quality?
- Can citizen assemblies and value pluralism replace single utility optimization?
- Can LLM-synthesized behavioral heuristics compete with learned policy improvements?
- Can tool use or self-conditioning fix degradation in extended LLM workflows?
- What limits external scaling when a model lacks reasoning foundation?
- Why does scaling data and model size improve compositional generalization?
- Can models adapt and combine search strategies beyond their training algorithm?
- Why do macro and micro forecasting scales require different reasoning approaches?
- What real-world forecasting domains benefit most from contextual reasoning integration?
- Can symbolic solvers reliably replace LLM reasoning for logical tasks?
- Can compute budget scaling replace annotation budget in process supervision training?
- Why do language models plateau at constraint satisfaction regardless of scale?
- Can language models execute iterative numerical methods in latent space?
- Can width-scaling replace depth-scaling on inherently sequential problems?
- How does saturation-aware aggregation encourage balanced improvements across multiple rubric dimensions?
- Does fine-tuning a small model match fine-tuning a large one?
- Can smaller LLMs perform tool use tasks through modular decomposition?
- Can categorical correctness signals stop dense optimizers from finding loopholes?
- Can we systematically enumerate LLM failure modes from first principles?
- Why do LLMs fail at iterative numerical computation in latent space?
- What constraint satisfaction rate do LLMs achieve at scale?
- Can single-problem fine-tuning match full RL pipeline reasoning gains?
- Can LLMs simultaneously reason and optimize their own modules?
- How do LLM activations sparsify differently under out-of-distribution inputs?
- Do newer language model generations improve forecasting ability without additional training?
- How do search and reasoning workflows improve forecasting performance over base models?
- Can language models match competitive crowd forecasters on real future events?
- Can a two-layer network outgeneralize billion-parameter models through recursion alone?
- Can scaling data alone solve performance gaps on long-tail concepts?
- What power-law scaling patterns emerge when consistency models are trained at scale?
- What capability boundary exists in LLM prediction of effect sizes?
- Can architectural changes reduce representational inequality in unified generators?
Related concepts in this collection 5
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Do reasoning models actually beat standard models on optimization?
Explores whether extended chain-of-thought in reasoning models delivers performance gains on constraint-satisfaction problems like power-grid optimization. Matters because reasoning models are treated as automatic upgrades, but the evidence may not support that claim.
same paper, the reasoning-model specific finding
-
Do large language models actually perform iterative optimization?
Explores whether LLMs execute genuine numerical procedures like Newton-Raphson or instead pattern-match to memorized solution templates when solving constrained optimization problems.
same paper, the mechanism for the plateau
-
Should LLMs handle abstraction only in optimization?
What if LLMs worked exclusively on translating problems to formal constraints, while deterministic solvers handled the numeric work? Explores whether this division of labor could overcome LLM failures in iterative computation.
same paper, the proposed solution
-
Does more thinking time always improve reasoning accuracy?
Explores whether extending a model's thinking tokens linearly improves performance, or if there's a point beyond which additional reasoning becomes counterproductive.
adjacent: chain-of-thought has its own ceiling
-
Can large language models translate natural language to logic faithfully?
This explores whether LLMs can convert natural language statements into formal logical representations without losing meaning. It matters because faithful translation is essential for any AI system that reasons formally or verifies specifications.
adjacent: NL → formal translation limits
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Can Large Language Models Reason and Optimize Under Constraints?
- ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning
- Branch-Solve-Merge Improves Large Language Model Evaluation and Generation
- Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey
- Distilling LLMs' Decomposition Abilities into Compact Language Models
- Scaling Laws Meet Model Architecture: Toward Inference-Efficient LLMs
- 𝙻𝙼𝟸: A Simple Society of Language Models Solves Complex Reasoning
- The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs
Original note title
LLMs plateau at 55 to 60 percent constraint satisfaction on genuine optimization regardless of scale architecture or training