Can mechanistic interpretability explain explanation-execution disconnection?
This explores whether the tools of mechanistic interpretability — looking inside a model to find the circuits and features behind its behavior — can actually account for why models say the right thing but do the wrong thing (the gap between a model's explanation and its execution).
This explores whether mechanistic interpretability can explain the gap between what a model can articulate and what it can actually do. The corpus suggests the disconnection is real and structural — but also that interpretability only partly reaches it, and on its own terms struggles to.
Start with the phenomenon itself. Models display what one note calls a "computational split-brain": they state correct principles at 87% accuracy but apply them at only 64%, which points to dissociated instruction and execution pathways rather than missing knowledge Can language models understand without actually executing correctly?. A parallel line reframes apparent "reasoning cliffs" as execution-bandwidth limits — give a model tools and it solves problems it supposedly couldn't reason through, so the bottleneck was procedural execution, not understanding Are reasoning model collapses really failures of reasoning?. Both say the same thing from different angles: knowing and doing live in separate machinery.
Can interpretability explain that? The most direct answer is a methodological one: you can't settle it by reading representations alone. Finding a feature that correlates with a principle tells you the model encodes it; only causal intervention tells you whether that feature actually drives the action — so a complete mechanistic claim needs both representational location and causal verification Can we understand LLM mechanisms with only representational analysis?. This is exactly the tool you'd want for explanation-execution disconnection, because the disconnection IS a causal gap: the explanation is present representationally but not wired into the output. Relatedly, interpretability reveals that understanding isn't one thing — models hold conceptual, world-state, and principled understanding in coexisting tiers, where higher-tier circuits sit alongside lower-tier heuristics rather than replacing them Do language models understand in fundamentally different ways?. That patchwork is a candidate mechanism for the split: the articulate explanation and the executed shortcut can be two different tiers firing.
But here's the twist the corpus delivers — the visible reasoning may not be the thing to interpret at all. Faithfulness tests show fine-tuning makes chain-of-thought less causally connected to the answer: truncate it, paraphrase it, or replace it with filler and the output often doesn't change, meaning the reasoning has gone performative rather than functional Does fine-tuning disconnect reasoning steps from final answers?. Push further and corrupted, semantically nonsensical traces train models as well as correct ones — traces act as computational scaffolding, not meaningful steps Do reasoning traces need to be semantically correct?. So the "explanation" half of the disconnect can be a surface artifact disconnected from the real computation, which is precisely why chain-of-thought is characterized as constrained imitation that "optimizes against interpretability" Why does chain-of-thought reasoning fail in predictable ways?.
That last point is the sting. Two findings suggest interpretability faces a moving target: models can carry every linearly-decodable feature a task needs while their internal organization is fractured and fragile in ways standard metrics never see Can models be smart without organized internal structure?, and reasoning failures track instance-novelty rather than clean task boundaries, so the mechanism shifts case by case Do language models fail at reasoning due to complexity or novelty?. The honest synthesis: mechanistic interpretability can *locate and causally test* explanation-execution disconnection — and it's arguably the only method rigorous enough to prove the gap is structural rather than a knowledge deficit — but the disconnection partly consists of explanations that were never load-bearing, which is itself something interpretability has to explain away before it can explain through. There's even an argument that some of this gap isn't a transparency problem at all but a communication one, where an explanation's value lives in who frames it for whom, not in the circuit that produced it What if XAI is fundamentally a communication problem?.
Sources 10 notes
Large language models can articulate correct principles but systematically fail to apply them due to dissociated instruction and execution pathways. The 87% accuracy in explanations versus 64% in actions reveals this is not knowledge deficit but structural disconnect.
Models confined to text-only generation cannot execute multi-step procedures at scale, even when they know the underlying algorithm. Tool-enabled models solve problems beyond the supposed reasoning cliff, suggesting the bottleneck is procedural execution bandwidth.
Representational analysis alone identifies correlations without causation; causal analysis alone shows behavioral effects without explaining them. Only paired methods—locating candidate features representationally, then verifying causally—produce complete mechanistic claims.
Mechanistic interpretability reveals conceptual understanding (features as directions), state-of-world understanding (factual connections), and principled understanding (compact circuits). Crucially, higher tiers coexist with lower-tier heuristics rather than replacing them, creating a patchwork of capabilities.
Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.
Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.
CoT guides models to pattern-match reasoning structure rather than perform genuine inference. This explains distribution-bounded failures, why structural coherence matters more than content correctness, and why performance optimizes against interpretability.
Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.
LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.
Explanation quality is not intrinsic to the explanation itself but depends on the rhetorical situation: who presents it, how it is framed, and what role the recipient plays. Evaluations that ignore this triad measure only a narrow slice of real-world effectiveness.