Can a model predict the right action but execute the wrong one?

This explores the gap between a model knowing the correct action and actually taking it — whether reasoning the right move and executing it are separate abilities that can come apart.

This reads the question as being about the "knowing-doing gap": can a model work out the right move and then fail to make it? The corpus says yes, and surprisingly often. The clearest evidence comes from work showing LLMs generate correct rationales about 87% of the time but actually act on that reasoning only 64% of the time Why do language models fail to act on their own reasoning?. The model isn't confused — it states the right plan and then defaults to a greedy, frequency-biased choice instead. That's a 23-point gap between knowing and doing that persists across model sizes, which means scaling alone doesn't close it.

Why would prediction and action diverge? One answer is that being accurate "on average" is not the same as being right where it counts. A model can fit data well overall yet systematically mispredict in exactly the decision-critical states that determine the outcome Why do accurate predictions lead to poor decisions?. So even a model with the right general picture can execute the wrong action precisely at the moments that matter most — accuracy and good decisions are formally distinct properties.

The failure also compounds once the model is acting in a loop. When a model's own earlier mistakes fill its context, performance degrades non-linearly — it starts conditioning on its own errors and digging deeper Do models fail worse when their own errors fill the context?. This matters because post-training pushes models from passive prediction toward treating their outputs as actions that shape future inputs Do models recognize their own outputs as actions shaping future inputs?, so a single wrong execution doesn't just cost one step — it contaminates everything downstream. There's also a directional bias baked into how models update: they're optimistic about actions they chose and pessimistic about the roads not taken Do language models learn differently from good versus bad outcomes?, which can lock in a wrong action even when the better one was knowable.

The useful twist is that some of these gaps are trainable rather than fundamental. Reinforcement learning can narrow the knowing-doing gap directly Why do language models fail to act on their own reasoning?, and there's a related lesson about *how* you reward: binary correct/incorrect rewards push models toward confident execution of wrong answers because they never penalize confident mistakes — adding a calibration term fixes this without sacrificing accuracy Does binary reward training hurt model calibration?. So the right-prediction-wrong-action problem is partly an artifact of training signals that reward acting decisively over acting correctly.

The thing worth walking away with: "knowing" and "doing" are genuinely separate capabilities in these systems, and the gap between them is a measurable, distinct failure mode — not just noise in a model that's otherwise too small. A model that reasons perfectly can still be a bad agent, and fixing the reasoning won't automatically fix the acting.

Sources 6 notes

Why do language models fail to act on their own reasoning?

LLMs generate correct reasoning 87% of the time but follow it only 64% of the time. Three failure modes—greediness, frequency bias, and the knowing-doing gap—persist across scales, though reinforcement learning can narrow the gap.

Why do accurate predictions lead to poor decisions?

Research formalizes necessary and sufficient conditions for predictive models to support optimal decisions. A model can predict accurately on average yet systematically mispredict in decision-critical states.

Do models fail worse when their own errors fill the context?

Error accumulation in context causes non-linear performance degradation in long-horizon tasks. Model scaling does not fix this; only test-time compute through thinking models reduces the effect by preventing error-contaminated context from biasing reasoning.

Do models recognize their own outputs as actions shaping future inputs?

Post-trained language models exhibit a measurable shift where they recognize their outputs become their own future inputs, closing an action-perception loop absent in pretraining. Evidence includes 3-4x lower output entropy on-policy and behavioral signatures of trajectory recognition.

Do language models learn differently from good versus bad outcomes?

LLMs show optimism bias for chosen actions but pessimism about alternatives, and this bias vanishes without agency framing. Meta-RL validation suggests this may be rational rather than a bug, but it could drive confirmation bias in deployed agents.

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Can a model predict the right action but execute the wrong one?

Sources 6 notes

Next inquiring lines