Can LLMs understand concepts they cannot apply?
Explores whether large language models can correctly explain ideas while simultaneously failing to use them—and whether that combination reveals something fundamentally different from ordinary mistakes.
The Potemkin understanding paper identifies a failure pattern that is categorically different from ordinary LLM error. When a model correctly explains an ABAB rhyme scheme, then fails to generate one, then recognizes that its generation doesn't rhyme — that triple combination is not just wrong, it is incoherent. No human with that explanation would behave that way. The combination is irreconcilable with any human cognitive pattern.
This is worth separating from other LLM failure types because the mechanism matters for diagnosis and repair:
- Ordinary errors (fabrication, factual mistakes) — the model lacks information or generates plausible-but-false continuations. Fix: better retrieval, grounding, training data.
- Surface generalizations — the model learned correlations that worked in training but don't generalize structurally. Fix: better training curriculum, structural probing.
- Potemkin understanding — the model can produce the explanation and fails to apply it and recognizes the failure. This combination implies that explanation-generation and concept-application are functionally disconnected. No single epistemic fix addresses both.
The "Potemkin" framing (after Potemkin villages — facades with nothing behind) is precise: the model passes benchmark tests designed to detect understanding because those benchmarks test the same cognitive operations as humans. The tests only work as diagnostics if LLMs misunderstand concepts the same way humans do. But Potemkin understanding means the model can perform at the surface without the underlying integration that tests were designed to probe.
Benchmarks used to evaluate LLMs are also used to evaluate people. They are valid tests only if LLMs fail in human-compatible ways. Potemkin understanding shows that this assumption fails — LLMs can fail in ways that no human cognitive model predicts.
The three-domain evidence (literary techniques, game theory, psychological biases) shows this is not domain-specific. Across domains: near-perfect explanation accuracy, significant application failure, model recognition of failure. The incoherence is stable.
The "computational split-brain syndrome" diagnosis. "Comprehension Without Competence" provides the architectural analysis underlying Potemkin understanding. Through controlled experiments, the authors demonstrate that instruction and action pathways are geometrically and functionally dissociated — a phenomenon they term computational split-brain syndrome. The failure is not in knowledge access but in computational execution. LLMs function as powerful pattern completion engines but lack the architectural scaffolding for principled, compositional reasoning. This diagnosis also clarifies why mechanistic interpretability findings may reflect training-specific pattern coordination rather than universal computational principles. The geometric separation between instruction and execution pathways represents a structural limitation, not a knowledge limitation.
The Explain-Query-Test (EQT) framework provides direct empirical measurement of the explanation-comprehension gap. In EQT, a model (1) generates an explanation of a topic, (2) generates question-answer pairs from that explanation, and (3) answers those same questions without access to its own explanation. The finding: models consistently fail questions derived from their own explanations. The EQT gap correlates strongly with MMLU-PRO benchmark performance — making EQT a benchmark-free evaluation method that uses only the model's own outputs as ground truth. Critically, the gap is domain-specific: biology and psychology (domains where models initially perform well) show the largest EQT drops, while law and engineering (lower baseline) show smaller drops. This suggests Potemkin understanding is worst precisely where surface performance is highest — a counterintuitive result that demands explanation. High benchmark performance may mask explanation-comprehension disconnection rather than reveal genuine understanding.
Inquiring lines that use this note as a source 181
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Why do some LLM clusters cite broader psychology than others?
- Can LLMs infer situational context the way humans do pragmatically?
- How does content-only knowledge in LLMs enable pretraining popularity to leak through?
- What should we call errors in LLM outputs when hallucination does not apply?
- How do fixed pragmatic templates prevent models from understanding context?
- What happens when LLMs analyze literary irony that relies on understatement?
- Why do LLMs fail inter-annotator agreement tests on argument evaluation?
- What alignment artifacts suppress critical knowledge in LLM-generated explanations?
- How does LLM hallucination risk manifest in knowledge graph construction?
- Where do LLMs succeed at generation but struggle with evaluation?
- Why do LLM personas struggle with specificity in specialized domains like law?
- Why do LLMs generate ideas that sound novel but fail during execution?
- What specific execution barriers do LLM ideas encounter most frequently?
- Can output-layer corrections fix fundamental cultural representation deficits in LLMs?
- How much of LLM reasoning failure stems from missing knowledge versus signal weighting?
- What makes a problem instance unfamiliar to a language model?
- Can LLMs use implicit background knowledge the way humans do in ordinary conversation?
- How widespread is task contamination in LLM evaluation benchmarks today?
- Why do LLM explanations feel authoritative even when alignment with the model fails?
- Can models identify what information they are missing in underspecified problems?
- Why do LLM outputs match researcher priors without solving tasks correctly?
- Why do NLP benchmarks systematically exclude ambiguous test cases from evaluation?
- Why do NLP benchmarks exclude ambiguous instances from evaluation?
- Is interpretive multiplicity a bug in language or a feature?
- Can large language models actually deliver cognitive behavioral therapy techniques?
- How does the outer loop escape its own LLM's knowledge boundaries when discovering mechanisms?
- Why do text-to-image models fail at composing multiple concepts together?
- Why do language models fail at planning despite understanding strategies?
- Why do reasoning models fail on structurally unfamiliar instances?
- Why do language models fail when semantic content is stripped away?
- Why does LLM research ideation collapse into low diversity despite high novelty?
- How do rare linguistic registers differ from conceptually complex examples?
- Can smaller open-source LLMs reliably detect agreement across unfamiliar topics?
- Why do monological explanations fail to transfer understanding compared to dialogical ones?
- Why do LLMs produce semantically acceptable but pragmatically disengaged responses?
- Can explicit connectives compensate for missing intentional tracking in LLMs?
- Why do large language models fail at temporal reasoning in complex legal cases?
- What happens when formal languages satisfy hierarchy but fail learnability constraints?
- Why does homework adherence remain low despite advances in language model capability?
- Why does LLM knowledge fail to influence their actual outputs?
- Can LLMs explain concepts correctly while failing to use them?
- What causes LLMs to ignore unstated constraints they know about?
- What cognitive capacities do LLMs actually lack that commentary assumes they have?
- Why does semantic decoupling specifically break LLM reasoning abilities?
- Can LLMs improve at metaphor if they handle decoupled semantics better?
- How does implicit meaning processing limit LLM pragmatic reasoning?
- How do LLMs compress specific expert knowledge into median abstraction?
- Why does entity recognition act as a self-knowledge mechanism in LLMs?
- Why does AI struggle with wordplay when it has access to word embeddings?
- Why do LLMs excel at generation but struggle with evaluation?
- Why do large language models still have systematic blind spots with complex structures?
- Which linguistic abilities are learnable from human-sized data exposure?
- Is relevant knowledge encoded in LMs but not causally active in generation?
- What reveals the epistemic limits of language models?
- How do we distinguish knowledge encoding from knowledge usage in models?
- Can instance seeds work for tasks beyond language understanding benchmarks?
- Does more thinking always help large language models or sometimes hurt?
- Can knowledge density explain why LLM writing feels coherent but fatiguing?
- Can pruning half of LLM layers affect knowledge retrieval performance?
- What role does failure and vulnerability play in real linguistic practice?
- Do LLMs struggle more with semantic accuracy than syntactic correctness across domains?
- Can LLMs learn to signal evaluative commitment through metadiscursive language?
- Can LLMs infer implicit meaning without surface linguistic markers?
- Why do LLMs fail at implicit elements in literary and poetic text?
- Can complexity-stratified testing reveal whether LLMs understand grammatical structure?
- Why do rare complex structures in training data harm LLM generalization?
- What happens when LLMs grade other LLMs in closed evaluation loops?
- Why do LLMs fail at semantic generalization despite grammatical accuracy?
- Can LLMs distinguish ethical cases that differ only in critical nouns?
- What structural limits prevent LLMs from abstracting moral principles?
- Does exposure to more domain-specific examples reduce LLM overconfidence?
- What distinguishes entity errors from relation errors in LLM output?
- Can language models accurately evaluate the quality of their own ideas?
- What makes a novel research idea practically infeasible for implementation?
- Can understanding language happen entirely within a language system alone?
- Where do LLMs fail as knowledge systems compared to humans?
- Why do LLMs generate novel ideas but lack evaluative commitment?
- What internal mechanisms explain LLM reasoning and representation limits?
- Why do NLP benchmarks hide LLM failures in ambiguity handling?
- Can training LLMs to form ad-hoc conventions improve their pragmatic reasoning?
- Why can LLMs identify argument structure but not check warrants?
- Why do LLMs fail when asked to use counter-commonsense rules explicitly?
- Can LLMs translate between natural language and formal logic faithfully?
- Do metaphors work by decoupling meaning from linguistic associations?
- Why do LLMs struggle with negation and exception handling?
- Why can't LLMs reason from first principles or initial commitments?
- Can LLMs identify implicit metaphoric mappings that require pragmatic inference?
- Do standard language benchmarks underestimate what LLMs can actually do?
- Why do LLMs explain evidence accurately while missing its implications?
- Do LLMs generate more novel ideas than they can evaluate?
- How might human-LLM teams reinforce each other's causal reasoning mistakes?
- Do LLMs rely on surface statistical patterns instead of causal structure?
- Why do standard NLP benchmarks hide the most critical language limitations?
- How does the inability to manage ambiguity undermine literary analysis tasks?
- Why can LLMs interpret formal logic better than they generate it?
- Can LLM semantic representations exist without causally influencing their generation output?
- Do LLMs fail exploration because of context integration or computational limitations?
- What data presentation structures enable LLMs to learn decision-making from examples?
- Can training procedures fix LLM accommodation of false presuppositions?
- How much does question framing affect LLM accuracy on knowledge tasks?
- Can LLMs learn to ask clarifying questions instead of guessing?
- How does structural complexity affect LLM performance differently than inferential complexity?
- Can LLMs improve at simple deduction through different training approaches?
- How do LLMs handle false presuppositions embedded in user questions?
- Why does LLM compression eliminate causal grounding in conceptual representations?
- Can encoder models match human conceptual structure better than larger language models?
- How much semantic meaning survives when LLMs paraphrase poetry and literary text?
- What distinguishes surface generalizations from true linguistic generalizations?
- Why do surface generalizations fail on unusual syntactic structures?
- How can a model explain something correctly yet fail to apply it?
- What distinguishes conceptual understanding from statistical pattern matching in models?
- Can LLMs reliably generate novel working architectures without structured representations?
- Why do NLP models fail at recognizing multiple valid interpretations?
- Do LLMs learn linguistic generalizations or just surface-level frequency patterns?
- Why do LLMs understand efficient language but fail to produce it?
- Why do benchmark tests fail to detect LLM comprehension gaps?
- Can behavioral self-awareness in LLMs extend to recognizing their own contradictions?
- Do LLMs learn surface patterns instead of genuine linguistic structure?
- How does structural complexity in sentences degrade LLM reasoning systematically?
- Can LLMs compute how presuppositions project through embedded clauses?
- Can language models distinguish between novel insight and unjustified conceptual blending?
- How does an instruction-following LLM activate latent retrieval knowledge?
- Can linear probing detect all the concepts a language model actually uses?
- Can LLMs recognize rhetorical devices they cannot actually produce themselves?
- Can LLMs distinguish stylistic patterns that carry meaning from mere convention?
- Why do models overthink underspecified problems instead of rejecting them?
- Why do LLMs generate novel ideas but struggle to evaluate them?
- What structural barriers prevent LLMs from making evaluative judgments about writing?
- Why do LLMs struggle to translate natural language into logical formalizations?
- Why does monological training prevent models from overriding statistical priors?
- How does the LLM Fallacy prevent users from noticing cognitive debt accumulating?
- Why do experts experiencing the LLM Fallacy fail to develop custodian skills?
- Why do reasoning models fail at learning hidden rules from sparse exceptions?
- Why do LLMs understand therapy techniques but fail to execute them?
- What training data barriers prevent LLMs from learning real Socratic dialogue?
- Can LLMs generate more novel research ideas than human experts?
- Can grammar alone repair misunderstanding without ritual correction work?
- How do structured benchmarks hide theory of mind failures in LLMs?
- Why do LLMs recognize graph entities without modeling their relationships?
- What happens when students encounter errors they cannot resolve through prompting alone?
- Why do backward-looking benchmarks underestimate LLM scientific value?
- Is the distinction between pretense and realization meaningful for LLMs?
- Why do LLMs fail at counterfactual reasoning despite factual knowledge?
- Can LLMs reason through semantics without understanding causal mechanisms?
- Does compressing Walton's schemes into nine categories make LLM classification easier?
- What concrete problems do LLMs solve at the computational level?
- What implicit knowledge about catalogs do LLMs learn from ranking signals alone?
- What latent mechanisms do LLMs use when they cannot execute iterative methods?
- What role do model-based critics play in validating LLM plans?
- Why do LLMs explain correct reasoning but then choose greedy actions?
- Why do LLM descriptions of argument schemes work better than formal definitions for classification?
- Why do LLMs choose incorrect edits despite understanding the task?
- How do knowing and doing diverge in LLM decision-making?
- Can surface-level correctness hide failures in structural learning by LLMs?
- Why do language models fail at understanding ambiguous or complex requirements?
- Why does the Chinese Room argument miss the deeper abstraction problem?
- Why do LLMs strip applicability conditions during memory abstraction?
- At what complexity does LLM discourse failure become practically harmful?
- Why do LLM stories over-explain themes and favor single-track plots?
- How does the knowing-doing gap relate to Potemkin understanding?
- Can we use LLM language without adopting LLM assumptions?
- How do LLMs lose information when translating natural language to formal logic?
- Why do LLMs fail at faithful autoformalisation of reasoning problems?
- How can we probe LLM representations in channels that training did not target?
- How does the pretraining distribution shape what LLMs find hard?
- How can humans evaluate explanations from systems they did not train?
- Why do language models ignore condensed memory even when it is the only memory?
- Why do LLMs rely on content knowledge instead of collaborative signals?
- Why do LLMs struggle more when only numerical values change?
- Can irrelevant information reliably expose the limits of LLM reasoning?
- How faithful are natural language explanations from LLMs really?
- Can LLMs reliably audit other language models for errors?
- What structural framework prevents LLM explanations from becoming just plausible fiction?
- How do LLM explanations diverge from actual internal reasoning?
- How should we rethink the symbolism versus connectionism debate in light of LLMs?
- Why do multimodal models fail on rare and underrepresented concepts?
- Why do LLMs reason fluently about causality but lack causal rigor?
- What prevents LLM representations from causally influencing generation outputs?
- Why does LLM fluency create false perceptions of professional standing and expertise?
- What capability boundary exists in LLM prediction of effect sizes?
- Do rare cultural concepts fail predictably as model scale increases?
Related concepts in this collection 6
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Do language models actually use their encoded knowledge?
Probes can detect that LMs encode facts internally, but do those encoded facts causally influence what the model generates? This explores the gap between knowing and doing.
related mechanism: knowledge can be present without causally influencing behavior; Potemkin extends this to a more observable test (explanation vs. application)
-
Can models pass tests while missing the actual grammar?
Do language models succeed on grammatical benchmarks by learning surface patterns rather than structural rules? This matters because correct outputs may hide reliance on shallow heuristics that fail on novel structures.
Potemkin understanding adds the recognition-of-failure component that surface generalization accounts don't predict
-
Do language models actually use their reasoning steps?
Chain-of-thought reasoning looks valid on the surface, but does each step genuinely influence the model's final answer, or are the reasoning chains decorative? This matters for trusting AI explanations.
faithful reasoning would prevent Potemkin: the explanation would causally constrain the application
-
Can identical outputs hide broken internal representations?
Can neural networks produce correct outputs while having fundamentally fractured internal structure that prevents generalization and creativity? This challenges our assumptions about what performance benchmarks actually measure.
FER provides the mechanistic cause of Potemkin understanding: the internal representation is fractured across arbitrary subdomains and entangled across unrelated computations, which is why explanation-generation and concept-application are functionally disconnected despite identical surface performance
-
What limits how much models can improve themselves?
Explores whether self-improvement has fundamental boundaries set by how well models can verify versus generate solutions, and what this means across different task types.
the generation-verification gap formalizes why Potemkin understanding paradoxically enables self-improvement: when explanation exceeds application, that gap is a usable training signal — the model's verification ability can supervise its generation ability
-
Why do language models fail to act on their own reasoning?
LLMs produce correct explanations far more often than they produce correct actions. What causes this knowing-doing gap, and can training methods close it?
quantified instance: the 87%/64% gap between correct rationales and correct actions in sequential decision-making is the most precisely measured example of Potemkin understanding; RL fine-tuning narrows the gap, suggesting the facade is partially trainable
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Large Language Model Reasoning Failures
- Comprehension Without Competence: Architectural Limits of LLMs in Symbolic Computation and Reasoning
- Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey
- Word Meanings in Transformer Language Models
- Explain-Query-Test: Self-Evaluating LLMs Via Explanation and Comprehension Discrepancy
- Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens
- The Model Says Walk: How Surface Heuristics Override Implicit Constraints in LLM Reasoning
- Probing Structured Semantics Understanding and Generation of Language Models via Question Answering
Original note title
potemkin understanding is a distinct failure mode where correct explanation combined with failed application is incoherent not merely wrong