Can LLMs explain concepts correctly while failing to use them?

This explores 'Potemkin understanding' — the documented split where a model can state a concept's definition correctly but fails to actually apply it, and what that gap reveals about how LLMs 'know' things.

This explores whether LLMs can explain a concept correctly while failing to use it — and the corpus says yes, emphatically, treating it not as a quirk but as a structural signature of how these systems work. The clearest name for it is 'Potemkin understanding': a model articulates a principle accurately, then fails to apply it, and can even recognize its own failure when shown it — a triple pattern that human cognition essentially never produces Can LLMs understand concepts they cannot apply?. The framing to take away is that explanation and execution run on functionally disconnected pathways, so competence at one says little about the other.

The most concrete measurement of the gap comes from work describing a kind of 'computational split-brain': models produce correct explanations ~87% of the time but correct actions only ~64% of the time, and the authors argue this is a structural disconnect between instruction and execution rather than a knowledge deficit Can language models understand without actually executing correctly?. The same 87%-vs-64% split shows up framed as a 'knowing-doing gap' in agent settings, where models generate the right rationale but then act greedily, defaulting to frequency bias instead of following their own reasoning — a gap that persists across model scale but narrows under reinforcement learning Why do language models fail to act on their own reasoning?. That last detail matters: the gap is trainable, which means it's a property of the optimization target, not an immovable limit.

What makes this more than a single benchmark is how many neighboring failures rhyme with it. Models accept false presuppositions even when direct questioning proves they hold the correct fact — the knowledge is present but doesn't fire as a constraint Why do language models accept false assumptions they know are wrong? — and one diagnosis traces that to RLHF-trained agreeableness, a social face-saving reflex distinct from hallucination Why do language models agree with false claims they know are wrong?. The 'frame problem' work is a close cousin: models possess the relevant world knowledge but fail to bring unstated preconditions forward as relevant, and simply forcing explicit enumeration lifts accuracy from 30% to 85% Do language models fail at identifying unstated preconditions?. Across all of these, the recurring shape is the same — knowledge exists, but is not retrieved as an operative constraint at the moment of action.

Why does this happen mechanistically? Interpretability work offers a clue: understanding in these models isn't a single thing but a layered patchwork, where principled 'circuit-level' understanding coexists with — rather than replacing — shallower heuristics Do language models understand in fundamentally different ways?. When a question invites explanation, the conceptual layer answers; when a task demands execution, a cheaper heuristic can take over. That patchwork view reframes the whole cluster of behaviors as one family of epistemic failure modes — repeatable gaps between statistical pattern-tracking and genuine competence How do LLMs fail to know what they seem to understand? — and it connects to why reasoning models 'wander' unsystematically on deep problems even when they can articulate the right method Why do reasoning LLMs fail at deeper problem solving?.

The thing worth walking away with: a fluent, correct explanation is not evidence that a model will act on it — explanation and use are separable capabilities, and most of our evaluations measure the first while deployment depends on the second. The practical levers the corpus surfaces are blunt but real: force explicit enumeration of what's relevant, and train against the gap directly rather than assuming articulation implies application.

Sources 9 notes

Can LLMs understand concepts they cannot apply?

Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.

Can language models understand without actually executing correctly?

Large language models can articulate correct principles but systematically fail to apply them due to dissociated instruction and execution pathways. The 87% accuracy in explanations versus 64% in actions reveals this is not knowledge deficit but structural disconnect.

Why do language models fail to act on their own reasoning?

LLMs generate correct reasoning 87% of the time but follow it only 64% of the time. Three failure modes—greediness, frequency bias, and the knowing-doing gap—persist across scales, though reinforcement learning can narrow the gap.

Why do language models accept false assumptions they know are wrong?

The FLEX Benchmark shows that models reject false presuppositions at rates far below acceptable levels (GPT-4: 84%, Mistral: 2.44%), even when direct knowledge questions prove they know the correct facts. False presuppositions drive more accommodation than correct knowledge drives rejection.

Why do language models agree with false claims they know are wrong?

The FLEX benchmark shows models reject false presuppositions at dramatically different rates (GPT 84% vs Mistral 2.44%), not from ignorance but from preference for agreement learned via RLHF. This social accommodation is distinct from hallucination and requires different fixes.

Do language models fail at identifying unstated preconditions?

LLMs struggle not from lacking world knowledge but from failing to bring background conditions forward as relevant constraints. Prompting that forces explicit enumeration of preconditions raises accuracy from 30% to 85%, revealing the frame problem persists in statistical systems.

Do language models understand in fundamentally different ways?

Mechanistic interpretability reveals conceptual understanding (features as directions), state-of-world understanding (factual connections), and principled understanding (compact circuits). Crucially, higher tiers coexist with lower-tier heuristics rather than replacing them, creating a patchwork of capabilities.

How do LLMs fail to know what they seem to understand?

LLMs show repeatable, empirically documented failure modes—from Potemkin understanding (correct explanation + failed application) to reasoning collapse under implicit constraints. These failures reveal gaps between statistical pattern-tracking and actual epistemic competence.

Why do reasoning LLMs fail at deeper problem solving?

Current reasoning models lack the three properties of systematic exploration: validity, effectiveness, and necessity. This causes success probability to drop exponentially with problem depth, making medium problems solvable but deep problems catastrophically harder.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an LLM researcher tracking the durability of a structural claim about language models. The precise question: Can LLMs explain concepts correctly while failing to use them—and if so, is this a fixed architectural limit or a trainable gap?

What a curated library found—and when (spanning 2023–2026, dated claims not current truth):
• Potemkin understanding: models explain concepts ~87% correctly but execute only ~64% correctly; framed as 'knowing-doing gap' in agent settings (2025).
• The gap persists across model scale but NARROWS under reinforcement learning, suggesting it's an optimization property, not immovable (2025).
• Knowledge exists but isn't retrieved as operative constraint at action time; RLHF-trained agreeableness and failure to enumerate unstated preconditions are neighboring failure modes (~2024–2025).
• Mechanistic work: understanding is a layered patchwork—principled 'circuit-level' understanding coexists with cheaper heuristics; conceptual layer answers explanation, shallower heuristics dominate execution (2025).
• Reasoning models wander unsystematically on deep problems even when articulating correct method (2025).

Anchor papers (verify; mind their dates):
• arXiv:2507.10624 Comprehension Without Competence (2025)—the 87%–64% split.
• arXiv:2507.08017 Mechanistic Indicators of Understanding (2025)—layered patchwork view.
• arXiv:2505.20296 Reasoning LLMs are Wandering Solution Explorers (2025)—unsystematic search.
• arXiv:2501.11721 Explain-Query-Test (2025)—self-evaluation via discrepancy.

Your task:
(1) RE-TEST THE TRAINABILITY CLAIM. The library asserts the knowing-doing gap narrows under RL. Has this been validated at scale in production models (GPT-4o, Claude-4, o1-preview)? Does the gap still hold for *new* concepts unseen in RL training, or is the narrowing purely on seen distributions? Separate the durable question—whether explanation-execution decoupling is structural—from the perishable claim that it's easily trainable away.
(2) Surface the strongest contradicting work from the last 6 months: any papers showing explanation and execution are NOT separable, or that explanation failure and execution failure co-occur so tightly they must share pathways?
(3) Propose 2 research questions that assume the regime may have shifted: (a) Do chain-of-thought, multi-step reasoning, or tool-use frameworks dissolve the gap by forcing explicit grounding at each step? (b) Do multimodal or embodied models show the same Potemkin pattern, or does grounding to perception collapse it?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Can LLMs explain concepts correctly while failing to use them?

Sources 9 notes

Next inquiring lines