Why do language models fail to act on their own reasoning?
LLMs produce correct explanations far more often than they produce correct actions. What causes this knowing-doing gap, and can training methods close it?
Three systematic failure modes explain why LLMs perform sub-optimally in sequential decision-making: greediness (premature commitment to exploitative strategies, leaving up to 55% of the action space unexplored), frequency bias (small models copying the most frequent actions regardless of reward), and the knowing-doing gap (producing correct rationales but failing to act on them).
The knowing-doing gap is the most conceptually significant finding. When LLMs generate chain-of-thought rationales about how to solve a decision-making task, 87% of the rationales are correct — yet only 64% of the subsequent actions follow the rationale's recommendation. The model knows what to do but defaults to greedy behavior instead of following its own reasoning.
Scale partially helps: larger models (27B) diminish frequency bias but remain greedy. RL fine-tuning on self-generated CoT rationales narrows all three gaps by increasing exploration and aligning actions with rationales. This suggests the gap is trainable, not architectural.
This connects directly to the concept of Potemkin understanding. Since Can LLMs understand concepts they cannot apply?, the knowing-doing gap is a measurable instance of exactly this pattern — the model demonstrates understanding in its rationale but fails in its action selection. The quantified gap (87% vs 64%) gives the Potemkin understanding concept empirical grounding.
The deeper implication is that CoT reasoning and action selection may involve different computational pathways. Since Do language models actually use their encoded knowledge?, the knowing-doing gap may reflect a disconnect where the reasoning trace is generated through one pathway while action selection draws on different (shallower, more habitual) computations.
Alice in Wonderland: the overconfidence amplifier. The "Alice in Wonderland" paper demonstrates a dramatic instance of the knowing-doing gap on trivially simple reasoning: "Alice has N brothers and M sisters. How many sisters does Alice's brother have?" Most SOTA models collapse entirely on this simple problem, producing incorrect answers with strong overconfidence while providing "reasoning-like explanations akin to confabulations" to justify clearly failed responses. Standard interventions (enhanced prompting, multi-step re-evaluation) fail to recover correct answers. The confabulation-like quality of the justifications directly parallels the knowing-doing gap: the model generates plausible reasoning traces that do not correspond to correct computation. Notable exceptions are Claude 3 Opus and GPT-4 which occasionally succeed — but still show frequent failures, suggesting the problem is architectural, not model-specific.
Inquiring lines that use this note as a source 19
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Why do LLM explanations feel authoritative even when alignment with the model fails?
- Why do language models fail at planning despite understanding strategies?
- Why does LLM knowledge fail to influence their actual outputs?
- Can LLMs explain concepts correctly while failing to use them?
- What causes LLMs to ignore unstated constraints they know about?
- What makes action-producing models fail in ways text models typically do not?
- Why can LLMs interpret formal logic better than they generate it?
- How can a model explain something correctly yet fail to apply it?
- Why do LLMs understand efficient language but fail to produce it?
- Can a model predict the right action but execute the wrong one?
- Why do LLMs understand therapy techniques but fail to execute them?
- Why do LLMs fail at counterfactual reasoning despite factual knowledge?
- Why do LLMs explain correct reasoning but then choose greedy actions?
- Why do LLMs choose incorrect edits despite understanding the task?
- How do knowing and doing diverge in LLM decision-making?
- Why do LLMs fail at faithful autoformalisation of reasoning problems?
- How do LLM explanations diverge from actual internal reasoning?
- Why do strong models struggle more with instruction following than mid-tier ones?
- What makes a model fail to activate relevant skills from its own harness?
Related concepts in this collection 5
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can LLMs understand concepts they cannot apply?
Explores whether large language models can correctly explain ideas while simultaneously failing to use them—and whether that combination reveals something fundamentally different from ordinary mistakes.
instantiates: the 87%/64% gap is a quantified example of Potemkin understanding
-
Do language models actually use their encoded knowledge?
Probes can detect that LMs encode facts internally, but do those encoded facts causally influence what the model generates? This explores the gap between knowing and doing.
explains: action selection may bypass the reasoning trace entirely
-
Does chain of thought reasoning actually explain model decisions?
When language models show their reasoning steps in agentic pipelines, does the quality of those steps predict or explain the quality of final outputs? This matters for trusting and debugging AI systems.
parallels: both show reasoning traces decoupled from downstream behavior
-
Does RL post-training create reasoning or just deploy it?
Investigates whether reasoning capability emerges during RL fine-tuning or already exists in base models. Matters because it reshapes how we build and optimize reasoning systems.
complicates: RL fine-tuning can narrow the knowing-doing gap, suggesting RL does teach something beyond timing
-
What limits how much models can improve themselves?
Explores whether self-improvement has fundamental boundaries set by how well models can verify versus generate solutions, and what this means across different task types.
the knowing-doing gap (87% rationales vs 64% actions) is an empirical instance of the generation-verification gap in decision-making: RL fine-tuning narrows this gap, consistent with the formal prediction that self-improvement operates precisely where verification exceeds generation
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey
- LLMs are Greedy Agents: Effects of RL Fine-tuning on Decision-Making Abilities
- Reasoning Strategies in Large Language Models: Can They Follow, Prefer, and Optimize?
- Large Language Model Reasoning Failures
- LLMs can implicitly learn from mistakes in-context
- Probing Structured Semantics Understanding and Generation of Language Models via Question Answering
- Procedural Knowledge in Pretraining Drives Reasoning in Large Language Models
- Are Emergent Abilities in Large Language Models just In-Context Learning?
Original note title
llms are greedy agents with a knowing-doing gap — correct rationales 87 percent but greedy actions 64 percent