Can mechanistic interpretability explain explanation-execution disconnection?

This explores whether the tools of mechanistic interpretability — looking inside a model to find the circuits and features behind its behavior — can actually account for why models say the right thing but do the wrong thing (the gap between a model's explanation and its execution).

This explores whether mechanistic interpretability can explain the gap between what a model can articulate and what it can actually do. The corpus suggests the disconnection is real and structural — but also that interpretability only partly reaches it, and on its own terms struggles to.

Start with the phenomenon itself. Models display what one note calls a "computational split-brain": they state correct principles at 87% accuracy but apply them at only 64%, which points to dissociated instruction and execution pathways rather than missing knowledge Can language models understand without actually executing correctly?. A parallel line reframes apparent "reasoning cliffs" as execution-bandwidth limits — give a model tools and it solves problems it supposedly couldn't reason through, so the bottleneck was procedural execution, not understanding Are reasoning model collapses really failures of reasoning?. Both say the same thing from different angles: knowing and doing live in separate machinery.

Can interpretability explain that? The most direct answer is a methodological one: you can't settle it by reading representations alone. Finding a feature that correlates with a principle tells you the model encodes it; only causal intervention tells you whether that feature actually drives the action — so a complete mechanistic claim needs both representational location and causal verification Can we understand LLM mechanisms with only representational analysis?. This is exactly the tool you'd want for explanation-execution disconnection, because the disconnection IS a causal gap: the explanation is present representationally but not wired into the output. Relatedly, interpretability reveals that understanding isn't one thing — models hold conceptual, world-state, and principled understanding in coexisting tiers, where higher-tier circuits sit alongside lower-tier heuristics rather than replacing them Do language models understand in fundamentally different ways?. That patchwork is a candidate mechanism for the split: the articulate explanation and the executed shortcut can be two different tiers firing.

But here's the twist the corpus delivers — the visible reasoning may not be the thing to interpret at all. Faithfulness tests show fine-tuning makes chain-of-thought less causally connected to the answer: truncate it, paraphrase it, or replace it with filler and the output often doesn't change, meaning the reasoning has gone performative rather than functional Does fine-tuning disconnect reasoning steps from final answers?. Push further and corrupted, semantically nonsensical traces train models as well as correct ones — traces act as computational scaffolding, not meaningful steps Do reasoning traces need to be semantically correct?. So the "explanation" half of the disconnect can be a surface artifact disconnected from the real computation, which is precisely why chain-of-thought is characterized as constrained imitation that "optimizes against interpretability" Why does chain-of-thought reasoning fail in predictable ways?.

That last point is the sting. Two findings suggest interpretability faces a moving target: models can carry every linearly-decodable feature a task needs while their internal organization is fractured and fragile in ways standard metrics never see Can models be smart without organized internal structure?, and reasoning failures track instance-novelty rather than clean task boundaries, so the mechanism shifts case by case Do language models fail at reasoning due to complexity or novelty?. The honest synthesis: mechanistic interpretability can *locate and causally test* explanation-execution disconnection — and it's arguably the only method rigorous enough to prove the gap is structural rather than a knowledge deficit — but the disconnection partly consists of explanations that were never load-bearing, which is itself something interpretability has to explain away before it can explain through. There's even an argument that some of this gap isn't a transparency problem at all but a communication one, where an explanation's value lives in who frames it for whom, not in the circuit that produced it What if XAI is fundamentally a communication problem?.

Sources 10 notes

Can language models understand without actually executing correctly?

Large language models can articulate correct principles but systematically fail to apply them due to dissociated instruction and execution pathways. The 87% accuracy in explanations versus 64% in actions reveals this is not knowledge deficit but structural disconnect.

Are reasoning model collapses really failures of reasoning?

Models confined to text-only generation cannot execute multi-step procedures at scale, even when they know the underlying algorithm. Tool-enabled models solve problems beyond the supposed reasoning cliff, suggesting the bottleneck is procedural execution bandwidth.

Can we understand LLM mechanisms with only representational analysis?

Representational analysis alone identifies correlations without causation; causal analysis alone shows behavioral effects without explaining them. Only paired methods—locating candidate features representationally, then verifying causally—produce complete mechanistic claims.

Do language models understand in fundamentally different ways?

Mechanistic interpretability reveals conceptual understanding (features as directions), state-of-world understanding (factual connections), and principled understanding (compact circuits). Crucially, higher tiers coexist with lower-tier heuristics rather than replacing them, creating a patchwork of capabilities.

Does fine-tuning disconnect reasoning steps from final answers?

Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.

Do reasoning traces need to be semantically correct?

Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.

Why does chain-of-thought reasoning fail in predictable ways?

CoT guides models to pattern-match reasoning structure rather than perform genuine inference. This explains distribution-bounded failures, why structural coherence matters more than content correctness, and why performance optimizes against interpretability.

Can models be smart without organized internal structure?

Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.

Do language models fail at reasoning due to complexity or novelty?

LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.

What if XAI is fundamentally a communication problem?

Explanation quality is not intrinsic to the explanation itself but depends on the rhetorical situation: who presents it, how it is framed, and what role the recipient plays. Evaluations that ignore this triad measure only a narrow slice of real-world effectiveness.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a mechanistic interpretability researcher re-evaluating a dated claim: Can interpretability actually explain why models articulate correct principles (87% accuracy) but fail to execute them (64%)?

What a curated library found — and when (findings span 2024–2026, treat as perishable constraints):
• Models show a 'computational split-brain': correct explanation and failed execution live in separate pathways, not a knowledge gap (2025-07, arXiv:2507.10624).
• Chain-of-thought reasoning degrades in causal fidelity after fine-tuning; truncated or paraphrased traces don't change outputs, making reasoning performative rather than functional (2025-02, 2025-06).
• Even semantically corrupted reasoning traces train models comparably to correct ones — traces act as computational scaffolding, not load-bearing computation (2025-05, arXiv:2505.13775).
• Mechanistic explanations require BOTH representational localization AND causal intervention; reading features alone cannot settle whether they drive behavior (2025-03, arXiv:2503.13401).
• Models hold coexisting hierarchical tiers of understanding (conceptual, world-state, heuristic), where articulate explanations and shortcuts can fire in parallel (2025-03).

Anchor papers (verify; mind their dates):
• arXiv:2507.10624 (2025-07) — comprehension-without-competence as architectural limit
• arXiv:2506.02878 (2025-06) — chain-of-thought as tight imitation, not reasoning
• arXiv:2505.13775 (2025-05) — reasonless intermediate tokens
• arXiv:2503.13401 (2025-03) — levels of analysis for mechanistic understanding

Your task:
(1) RE-TEST THE CAUSAL REACH: For each finding above, ask whether post-2026 models, training methods (e.g., new RL objectives, SAE-guided training), or causal intervention tooling (stronger activation patching, circuit isolation) have RELAXED the split or made reasoning MORE causally faithful. Does the 87→64% gap still hold? Does fine-tuning still degrade CoT fidelity? Flag where the constraint persists and where capability progress has collapsed it.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months: Has interpretability community resolved whether CoT is computation or imitation? Are newer sparse autoencoders closing the explanation-execution gap more than 2025 methods?
(3) Propose 2 new research questions that ASSUME the regime may have shifted: (a) If mechanistic interpretability can now causally rewire explanation into execution, what's the minimum intervention cost? (b) If explanations are communication artifacts, not circuits, how does that reframe what interpretability owes the field?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Can mechanistic interpretability explain explanation-execution disconnection?

Sources 10 notes

Next inquiring lines