What mechanism causes LLMs to plateau on numerical optimization tasks?

This explores why LLMs hit a hard ceiling on math optimization problems — not a temporary limit they'd scale past, but a structural one, and what's actually happening under the hood when they fail.

This explores why LLMs hit a hard ceiling on numerical optimization — and the corpus points to a single root cause with several faces. The short version: LLMs don't actually *iterate*. When you hand a model an optimization problem, it doesn't run the loop of guess-check-adjust that a real solver runs. Instead it recognizes the problem as similar to templates it has seen and emits plausible-looking numbers Do large language models actually perform iterative optimization?. The values look right and are often wrong, because no genuine numerical procedure happened in the latent space at all.

The most striking evidence that this is a wall and not a slope: across constrained-optimization tasks, models converge to roughly 55–60% constraint satisfaction *regardless of size, architecture, or training* — and reasoning-tuned models don't systematically beat standard ones Do larger language models solve constrained optimization better?. That flat line across scale is the tell. If this were a data or parameter problem, bigger models would climb. They don't. So the plateau is a property of how these systems work, not how big they are.

Why doesn't fine-tuning fix it? Because fine-tuning sharpens the wrong thing. RL-tuned models (even GRPO) collapse on out-of-distribution variants — the same problem with one element changed — which means the training installed better template-matching, not an actual reasoning procedure Do fine-tuned language models actually learn optimization procedures?. This connects to a broader failure mode the corpus calls a kind of split-brain: models can state the correct principle at 87% accuracy but fail to *execute* it, scoring 64% in action Can language models understand without actually executing correctly?. Knowing the method and running the method are dissociated pathways. Optimization is pure execution, so it lands squarely in the gap.

The interesting lateral thread is that this isn't only an arithmetic problem — it's an exploration and search problem. LLMs are described elsewhere as "wandering explorers, not systematic searchers," lacking the validity, effectiveness, and necessity that make search converge, which is exactly why success drops exponentially as problem depth grows Why do reasoning LLMs fail at deeper problem solving?. The same brittleness shows up in simple decision tasks: models can't reliably track and aggregate their own interaction history to guide the next move without external scaffolding Why do LLMs struggle with exploration in simple decision tasks?. Iteration *is* sustained, self-correcting search over your own prior steps — and that is the thing these models structurally don't do.

The productive response in the corpus isn't to push harder against the wall — it's to route around it. Restrict the LLM to what it's genuinely good at: reading messy input and translating it into formal structure, then hand the actual numeric grinding to a deterministic solver Should LLMs handle abstraction only in optimization?. A related trick has LLMs solve a simplified, deterministic version of a hard problem and use that as scaffolding for the real one Can LLMs design reward functions for reinforcement learning?. The lesson worth taking away: the plateau isn't a bug to be trained out, it's a boundary to design around — use the model as a translator into math, not as the calculator.

Sources 8 notes

Do large language models actually perform iterative optimization?

Research shows LLMs cannot perform iterative procedures in latent space. They recognize optimization problems as template-similar and emit plausible-looking but incorrect values, a failure mode that persists across model scale and training approaches.

Do larger language models solve constrained optimization better?

Across constrained-optimization tasks, LLMs converge to ~55–60% constraint satisfaction independent of architecture, parameter count, or training regime. Reasoning models do not systematically outperform standard models, suggesting a fundamental ceiling rather than a scaling gap.

Do fine-tuned language models actually learn optimization procedures?

Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.

Can language models understand without actually executing correctly?

Large language models can articulate correct principles but systematically fail to apply them due to dissociated instruction and execution pathways. The 87% accuracy in explanations versus 64% in actions reveals this is not knowledge deficit but structural disconnect.

Why do reasoning LLMs fail at deeper problem solving?

Current reasoning models lack the three properties of systematic exploration: validity, effectiveness, and necessity. This causes success probability to drop exponentially with problem depth, making medium problems solvable but deep problems catastrophically harder.

Why do LLMs struggle with exploration in simple decision tasks?

Across multi-armed bandit environments, only GPT-4 with explicit exploratory hints, external history summarization, and chain-of-thought reasoning achieves satisfactory exploration. Without external summarization, models cannot reliably track and aggregate unstructured interaction history to guide exploratory decisions.

Should LLMs handle abstraction only in optimization?

LLMs plateau at constraint satisfaction regardless of scale, but excel at natural-language-to-formal-structure translation. The productive architecture restricts LLMs to reading input and emitting solver code, leaving numeric iteration to deterministic solvers.

Can LLMs design reward functions for reinforcement learning?

MEDIC shows that LLMs can generate effective reward shaping functions by first solving a deterministic, simplified version of the RL problem, then converting the resulting plan into shaping rewards for the original stochastic task. A model-based critic validates LLM outputs before deployment.

What mechanism causes LLMs to plateau on numerical optimization tasks?

Sources 8 notes

Next inquiring lines