Why do LLMs fail at directly solving stochastic control problems?
This explores why LLMs stumble when asked to *directly* compute solutions to control problems involving randomness and uncertainty — and what the corpus suggests works instead.
This question reads as: why can't an LLM just take a stochastic control problem — one where outcomes are uncertain and you have to optimize a policy over many possible futures — and produce the right answer end-to-end? The corpus points to a single underlying reason with two faces: LLMs are autoregressive pattern-matchers, not iterative solvers, and stochastic control is precisely a problem that demands iteration over uncertainty.
The sharpest evidence is that LLMs don't actually *run* numerical procedures — they recognize a problem as template-similar to ones they've seen and emit plausible-looking values without executing the underlying computation Do large language models actually perform iterative optimization?. Stochastic control is built on exactly the kind of iterative machinery (value iteration, expectation over distributions, policy refinement) that this failure mode breaks. You can see the symptom downstream: on genuine constrained-optimization tasks, models plateau around 55–60% constraint satisfaction regardless of size or whether they're 'reasoning' models — a ceiling, not a scaling gap Do larger language models solve constrained optimization better?. More scale doesn't buy you the missing capability because the capability isn't a matter of degree.
There's a deeper, almost information-theoretic version of why. If you frame an LLM as a machine that maximizes the probability of the next token, then tasks whose *correct* answer is a low-probability string become systematically hard even when they're logically trivial — and this is predictable in advance Can we predict where language models will fail?. The optimal action in a stochastic problem is often not the 'fluent' or typical-looking one; it's whatever the math says, which may sit far from the model's learned distribution over plausible continuations. The architecture's strength (fluency) is the same thing working against it.
What's interesting is what the corpus says *does* work — and it's a consistent move: don't ask the LLM to solve the stochastic problem, ask it to do the part it's good at and offload the solving. MEDIC has the LLM solve a *deterministic, simplified* version of the problem first, then converts that plan into reward-shaping signals for the real stochastic task, with a model-based critic checking the output before it's trusted Can LLMs design reward functions for reinforcement learning?. The same philosophy shows up in LLM Programs, which embed the model inside an explicit algorithm that manages state and control flow, handing the LLM only the narrow, step-specific judgment it's reliable for Can algorithms control LLM reasoning better than LLMs alone?.
The thing you might not have expected: the answer to 'LLMs can't do control' is not 'so don't use them for control.' Reinforcement learning *does* successfully scale LLMs to long-horizon, stateful, delayed-reward tasks — doubling SWE-bench performance in multi-step environments Can reinforcement learning scale beyond single-turn language tasks?. The lesson across the collection is that the LLM works as a *component* whose behavior is shaped by an external optimization loop (or a deterministic scaffold, or a critic), rather than as the solver that internalizes the stochastic dynamics itself. The failure is about asking one tool to be the whole pipeline.
Sources 6 notes
Research shows LLMs cannot perform iterative procedures in latent space. They recognize optimization problems as template-similar and emit plausible-looking but incorrect values, a failure mode that persists across model scale and training approaches.
Across constrained-optimization tasks, LLMs converge to ~55–60% constraint satisfaction independent of architecture, parameter count, or training regime. Reasoning models do not systematically outperform standard models, suggesting a fundamental ceiling rather than a scaling gap.
By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.
MEDIC shows that LLMs can generate effective reward shaping functions by first solving a deterministic, simplified version of the RL problem, then converting the resulting plan into shaping rewards for the original stochastic task. A model-based critic validates LLM outputs before deployment.
LLM Programs embed LLMs within explicit algorithms that manage control flow and state, presenting only step-specific context to each LLM call. This information hiding addresses capability and context window limits while treating complex reasoning as modular, debuggable sub-tasks.
Modified DAPO training doubled SWE-bench Verified performance from 20% to 39% on Qwen2.5-72B, matching larger models. This demonstrates RL works in stateful multi-step environments with delayed rewards and complex feedback, beyond theoretical single-turn MDPs.