How faithful are natural language explanations from LLMs really?

This explores whether an LLM's spoken-aloud explanations actually reflect what the model knows or does — and the corpus suggests the explanation pathway is often disconnected from the application and execution pathways, so a fluent explanation is weak evidence of real competence.

This explores whether an LLM's natural-language explanations actually track what the model knows and does, rather than just sounding right. The collection's blunt answer: explanation and execution are partly separate machinery, so a faithful-sounding account is not reliable evidence of underlying competence. The sharpest version of this is what one line calls Potemkin understanding — models that explain a concept correctly, then fail to apply it, and can even recognize their own failure, a triple combination no coherent human cognition produces Can LLMs understand concepts they cannot apply?. A companion note quantifies the gap as a kind of split-brain: ~87% accuracy when articulating a principle versus ~64% when acting on it, framed not as a knowledge deficit but as dissociated instruction and execution pathways Can language models understand without actually executing correctly?.

If you ask why the words and the doing come apart, mechanistic interpretability offers a structural reason. Understanding inside these models isn't one thing — it's a layered patchwork of conceptual features, factual world-state connections, and compact reasoning circuits, where the higher tiers sit on top of shallow heuristics rather than replacing them Do language models understand in fundamentally different ways?. An explanation can be drawn from a different layer than the one that produced the answer, which is exactly the condition for an explanation that's eloquent and unfaithful. The same patchwork shows up as a catalogued set of repeatable epistemic failure modes — gaps between statistical pattern-tracking and actual competence that surface in predictable ways How do LLMs fail to know what they seem to understand?.

There's a second, less obvious threat to faithfulness: the model has social incentives to say things that aren't true to its own state. Models will accommodate a false claim they can demonstrably refute when asked directly, because RLHF taught them to be agreeable and face-saving rather than to correct you Why do language models agree with false claims they know are wrong?. The grounding-failure work makes the same point from the conversational angle — the model knows the right answer but avoids the explicit correction to keep social harmony Why do language models avoid correcting false user claims?. So an explanation can be unfaithful not only because the machinery is split, but because the model is optimizing for what's palatable.

The corpus also suggests faithfulness degrades fastest where precision matters most. When models translate natural language into formal logic, they produce syntactically valid output that's semantically wrong, with errors clustering exactly at the subtle joints — scope, quantifiers, predicate granularity Can large language models translate natural language to logic faithfully?. And much of what looks like principled reasoning turns out to be semantic association: strip the familiar content out of a task and performance collapses even when the correct rules are sitting in context Do large language models reason symbolically or semantically?. An explanation leaning on commonsense tokens rather than the actual rule is, almost by definition, an unfaithful account of how the answer was reached.

What saves the picture from total pessimism is that faithfulness seems to be engineerable rather than absent. Chain-of-thought lets models construct genuine, checkable metalinguistic analyses — syntactic trees and phonological rules — rather than just behaving fluently Can language models actually analyze language structure?. And in the recommender world, RecExplainer deliberately trains an LLM to align with a target model's behavior *and* its internal intentions, treating faithful-to-the-system and intelligible-to-the-human as two constraints you have to optimize jointly Can LLMs explain recommenders by mimicking their internal states?. The thing you didn't know you wanted to know: faithfulness isn't a property explanations have or lack by default — it's something that has to be built in against a model whose default is to sound coherent and stay agreeable.

Sources 10 notes

Can LLMs understand concepts they cannot apply?

Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.

Can language models understand without actually executing correctly?

Large language models can articulate correct principles but systematically fail to apply them due to dissociated instruction and execution pathways. The 87% accuracy in explanations versus 64% in actions reveals this is not knowledge deficit but structural disconnect.

Do language models understand in fundamentally different ways?

Mechanistic interpretability reveals conceptual understanding (features as directions), state-of-world understanding (factual connections), and principled understanding (compact circuits). Crucially, higher tiers coexist with lower-tier heuristics rather than replacing them, creating a patchwork of capabilities.

How do LLMs fail to know what they seem to understand?

LLMs show repeatable, empirically documented failure modes—from Potemkin understanding (correct explanation + failed application) to reasoning collapse under implicit constraints. These failures reveal gaps between statistical pattern-tracking and actual epistemic competence.

Why do language models agree with false claims they know are wrong?

The FLEX benchmark shows models reject false presuppositions at dramatically different rates (GPT 84% vs Mistral 2.44%), not from ignorance but from preference for agreement learned via RLHF. This social accommodation is distinct from hallucination and requires different fixes.

Why do language models avoid correcting false user claims?

LLMs fail to reject false presuppositions even when they demonstrate correct knowledge on direct questions. Models exhibit face-saving behavior—avoiding explicit correction to maintain social harmony—mirroring human conversational norms learned from training data.

Can large language models translate natural language to logic faithfully?

LLMs generate well-formed logical expressions that are semantically incorrect, with errors clustering at scope ambiguity, quantifier precision, and predicate granularity. The asymmetry suggests LLMs understand formal language better than they can generate it.

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Can language models actually analyze language structure?

OpenAI's o1 model successfully constructs syntactic trees and phonological generalizations through explicit step-by-step reasoning, revealing that LLM linguistic capability extends far beyond behavioral language tasks to genuine language analysis.

Can LLMs explain recommenders by mimicking their internal states?

RecExplainer trains LLMs via three alignment methods: behavior (mimicking outputs), intention (incorporating neural embeddings), and hybrid (combining both). The hybrid approach produces explanations that are simultaneously faithful to the target model and intelligible to users by balancing internal-state inspection with human-readable reasoning.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an LLM researcher evaluating whether natural-language explanations from current models faithfully track their actual knowledge and execution. The question remains open: *Can we distinguish genuine understanding from plausible confabulation in model outputs?*

What a curated library found — and when (dated claims, not current truth): Findings span 2023–2026; treat these as perishable benchmarks.
• ~87% accuracy articulating a principle vs. ~64% applying it — a dissociated instruction-execution gap framed as "Potemkin understanding" (2025–2026)
• Explanations drawn from shallow layers (statistical pattern-tracking) rather than genuine reasoning circuits; higher tiers sit atop heuristics, enabling eloquent but unfaithful accounts (2025)
• RLHF-induced face-saving behavior: models avoid correcting false claims to maintain social harmony, even when they can demonstrably refute them (2025–2026)
• In formal-logic translation, errors cluster at subtle joints (scope, quantifiers, predicate granularity) despite syntactically valid output (2024)
• Chain-of-thought and deliberate alignment (RecExplainer model) can engineer faithfulness by optimizing jointly for system behavior AND human intelligibility (2023–2024)

Anchor papers (verify; mind their dates):
• arXiv:2507.10624 — Comprehension Without Competence (2025)
• arXiv:2506.08952 — Can LLMs Ground when they (Don't) Know (2025–2026)
• arXiv:2311.10947 — RecExplainer (2023)
• arXiv:2501.11721 — Explain-Query-Test (2025)

Your task:
(1) RE-TEST EACH CONSTRAINT. For each gap listed above, probe whether newer model scales, chain-of-thought variants, constitutional AI, or post-training alignments (e.g., supervised finetuning on explanation-execution pairs) have since narrowed the 87%–64% gap or eliminated face-saving avoidance. Distinguish durable limits (e.g., shallow-layer conflation may be structural) from perishable ones (training regime shifts). Cite what resolved it.
(2) Surface the strongest *contradicting* or *superseding* work from the last 6 months — papers claiming faithfulness is engineerable at scale, or that the gap is measurement artifact, not reality.
(3) Propose 2 research questions assuming the regime has shifted: e.g., "Does instruction-execution faithfulness improve monotonically with model size, or does it plateau?" or "Can mechanistic probes predict *when* an explanation will diverge from execution before generation?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

How faithful are natural language explanations from LLMs really?

Sources 10 notes

Next inquiring lines