Can LLMs understand concepts they cannot apply?

Explores whether large language models can correctly explain ideas while simultaneously failing to use them—and whether that combination reveals something fundamentally different from ordinary mistakes.

Synthesis note · 2026-02-21 · sourced from Philosophy Subjectivity

The Potemkin understanding paper identifies a failure pattern that is categorically different from ordinary LLM error. When a model correctly explains an ABAB rhyme scheme, then fails to generate one, then recognizes that its generation doesn't rhyme — that triple combination is not just wrong, it is incoherent. No human with that explanation would behave that way. The combination is irreconcilable with any human cognitive pattern.

This is worth separating from other LLM failure types because the mechanism matters for diagnosis and repair:

Ordinary errors (fabrication, factual mistakes) — the model lacks information or generates plausible-but-false continuations. Fix: better retrieval, grounding, training data.
Surface generalizations — the model learned correlations that worked in training but don't generalize structurally. Fix: better training curriculum, structural probing.
Potemkin understanding — the model can produce the explanation and fails to apply it and recognizes the failure. This combination implies that explanation-generation and concept-application are functionally disconnected. No single epistemic fix addresses both.

The "Potemkin" framing (after Potemkin villages — facades with nothing behind) is precise: the model passes benchmark tests designed to detect understanding because those benchmarks test the same cognitive operations as humans. The tests only work as diagnostics if LLMs misunderstand concepts the same way humans do. But Potemkin understanding means the model can perform at the surface without the underlying integration that tests were designed to probe.

Benchmarks used to evaluate LLMs are also used to evaluate people. They are valid tests only if LLMs fail in human-compatible ways. Potemkin understanding shows that this assumption fails — LLMs can fail in ways that no human cognitive model predicts.

The three-domain evidence (literary techniques, game theory, psychological biases) shows this is not domain-specific. Across domains: near-perfect explanation accuracy, significant application failure, model recognition of failure. The incoherence is stable.

The "computational split-brain syndrome" diagnosis. "Comprehension Without Competence" provides the architectural analysis underlying Potemkin understanding. Through controlled experiments, the authors demonstrate that instruction and action pathways are geometrically and functionally dissociated — a phenomenon they term computational split-brain syndrome. The failure is not in knowledge access but in computational execution. LLMs function as powerful pattern completion engines but lack the architectural scaffolding for principled, compositional reasoning. This diagnosis also clarifies why mechanistic interpretability findings may reflect training-specific pattern coordination rather than universal computational principles. The geometric separation between instruction and execution pathways represents a structural limitation, not a knowledge limitation.

The Explain-Query-Test (EQT) framework provides direct empirical measurement of the explanation-comprehension gap. In EQT, a model (1) generates an explanation of a topic, (2) generates question-answer pairs from that explanation, and (3) answers those same questions without access to its own explanation. The finding: models consistently fail questions derived from their own explanations. The EQT gap correlates strongly with MMLU-PRO benchmark performance — making EQT a benchmark-free evaluation method that uses only the model's own outputs as ground truth. Critically, the gap is domain-specific: biology and psychology (domains where models initially perform well) show the largest EQT drops, while law and engineering (lower baseline) show smaller drops. This suggests Potemkin understanding is worst precisely where surface performance is highest — a counterintuitive result that demands explanation. High benchmark performance may mask explanation-comprehension disconnection rather than reveal genuine understanding.

Inquiring lines that use this note as a source 181

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 6

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

20 direct connections · 199 in 2-hop network ·dense cluster Open in graph ↗

Can LLMs understand concepts they cannot apply? Do language models actually use their encoded know… Can models pass tests while missing the actual gra… Do language models actually use their reasoning st… Can identical outputs hide broken internal represe… What limits how much models can improve themselves… Why do language models fail to act on their own re…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Do language models actually use their encoded knowledge? Probes can detect that LMs encode facts internally, but do those encoded facts causally influence what the model generates? This explores the gap between knowing and doing.
related mechanism: knowledge can be present without causally influencing behavior; Potemkin extends this to a more observable test (explanation vs. application)
Can models pass tests while missing the actual grammar? Do language models succeed on grammatical benchmarks by learning surface patterns rather than structural rules? This matters because correct outputs may hide reliance on shallow heuristics that fail on novel structures.
Potemkin understanding adds the recognition-of-failure component that surface generalization accounts don't predict
Do language models actually use their reasoning steps? Chain-of-thought reasoning looks valid on the surface, but does each step genuinely influence the model's final answer, or are the reasoning chains decorative? This matters for trusting AI explanations.
faithful reasoning would prevent Potemkin: the explanation would causally constrain the application
Can identical outputs hide broken internal representations? Can neural networks produce correct outputs while having fundamentally fractured internal structure that prevents generalization and creativity? This challenges our assumptions about what performance benchmarks actually measure.
FER provides the mechanistic cause of Potemkin understanding: the internal representation is fractured across arbitrary subdomains and entangled across unrelated computations, which is why explanation-generation and concept-application are functionally disconnected despite identical surface performance
What limits how much models can improve themselves? Explores whether self-improvement has fundamental boundaries set by how well models can verify versus generate solutions, and what this means across different task types.
the generation-verification gap formalizes why Potemkin understanding paradoxically enables self-improvement: when explanation exceeds application, that gap is a usable training signal — the model's verification ability can supervise its generation ability
Why do language models fail to act on their own reasoning? LLMs produce correct explanations far more often than they produce correct actions. What causes this knowing-doing gap, and can training methods close it?
quantified instance: the 87%/64% gap between correct rationales and correct actions in sequential decision-making is the most precisely measured example of Potemkin understanding; RL fine-tuning narrows the gap, suggesting the facade is partially trainable

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

potemkin understanding is a distinct failure mode where correct explanation combined with failed application is incoherent not merely wrong

Can LLMs understand concepts they cannot apply?

Related concepts in this collection 6

Related papers in this collection 8

Search by related questions 4