Can we understand LLM mechanisms with only representational analysis?
Explores whether mapping what information a model encodes is sufficient for mechanistic understanding, or whether causal verification is equally necessary to claim genuine mechanism.
The implementation-level argument in Levels of Analysis for LLMs is that representational analysis and causal analysis are partners, not alternatives. Representational analysis maps what information a model encodes — which features, circuits, attention heads carry which signals. Causal analysis tests whether the information that is encoded actually drives behavior — through interventions, ablations, activation patches. Either method alone produces an incomplete account: a representation that is encoded but causally inert is a curiosity, and a causal effect with no representational characterization is unexplained.
The synergy matters because both methods can fool you alone. Representational analysis can identify features that correlate with behavior without showing they cause it — a classic confound. Causal analysis can demonstrate that intervening on some component changes behavior without telling you what that component encodes — the lesion shows damage but not function. The combination — representational analysis locates candidates, causal analysis tests their functional role — is what produces mechanistic claims rather than descriptive ones.
This has methodological consequences for interpretability research. Studies that report only feature visualizations or only activation patches contribute, but they do not close the loop. The convergent evidence comes from pairs: locate a candidate feature representationally, then verify it causally; identify a causal component, then map its representation. The literature on attention circuits, induction heads, and feature dictionaries has been moving toward this pairing.
For LLM understanding specifically, this template explains why some claimed "mechanisms" have not held up. They were representational without causal verification (a feature that looked like task encoding but did not drive task behavior) or causal without representational characterization (an intervention that mattered but described nothing). The discipline imported from cognitive neuroscience is to demand both.
Inquiring lines that use this note as a source 64
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- What distinguishes LLM fabrication from genuine theoretical reasoning?
- What is the mechanistic signature when models chain facts never presented together?
- What distinguishes genuine cultural understanding from exploited surface-level elimination strategies?
- Can mechanistic interpretability reveal how ideologies decompose into simpler features?
- What domain properties determine whether causal rules transfer to new agents?
- How does Peircean Secondness differ from what RLHF actually provides?
- Do causal rules enforce robustness that statistical patterns alone cannot maintain?
- How does the outer loop escape its own LLM's knowledge boundaries when discovering mechanisms?
- What audit techniques best complement each other for detecting hidden model goals?
- What makes causal belief networks more auditable than prompted personas?
- Can causal models be extended to include non-causal cognition?
- How much do mechanistic interpretability findings reflect true reasoning architecture?
- Can models distinguish between truthfulness and honesty mechanistically?
- What inductive bias would force models to learn Newtonian mechanics instead of shortcuts?
- Which hedging markers function as causal pivots versus noise in traces?
- Can steering vectors prove that representations are genuinely organized?
- How do we distinguish knowledge encoding from knowledge usage in models?
- Why does DPO create introspective detection circuits but SFT does not?
- How much introspective capability do safety mechanisms actively suppress in models?
- Are detection and identification of injections truly separable in neural circuits?
- How do world models create indirect causal grounding without physical environment contact?
- What internal mechanisms explain LLM reasoning and representation limits?
- Do LLMs rely on surface statistical patterns instead of causal structure?
- Can LLM semantic representations exist without causally influencing their generation output?
- Are traditional cognitive theories missing interaction effects between mechanisms?
- Do causal histories determine what mental states a system can instantiate?
- Can LLMs have minimal introspection through causal linkage to internal states?
- Can mechanistic interpretability explain explanation-execution disconnection?
- Can functional behavior alone capture what makes something a genuine belief?
- What behavioral markers distinguish realized quasi-states from pretended ones?
- What consumption data would validate the limited-consumption model in production systems?
- Can LLMs reason through semantics without understanding causal mechanisms?
- How does semantic association differ from mechanistic causal reasoning?
- What's the difference between representing world facts and generating world mechanisms?
- How do delayed effects complicate causal attribution in agent systems?
- Does causal intervention alone explain how neural mechanisms implement representations?
- Can a single dominant mechanism replace the combined effect of all five?
- Can a perfect behavioral simulation constitute genuine understanding or experience?
- How does vehicle causality differ from content causality in physical systems?
- What makes attractor-based probing better for third-party model auditing than alternatives?
- Can attractor dynamics compete with input-based probing for characterizing model knowledge?
- Can geometric structure in representations exist without supporting functional mechanisms?
- How does mechanistic interpretability complement learning mechanics in explaining deep learning?
- How do classical mechanics and statistical mechanics provide methodological templates for learning theory?
- Why do attention circuits need causal verification beyond feature visualization?
- What distinguishes a representational feature from a causally inert correlation?
- How do ablation studies reveal function without representational characterization?
- Can interventions on model components prove mechanism without explaining encoding?
- Can mechanistic interpretability tools decode the biases alignment training conceals?
- What distinguishes mechanical generation failures from deliberate behavioral withholding?
- Can we systematically enumerate LLM failure modes from first principles?
- Can spectral eigenvector ordering serve as a model-agnostic interpretability probe?
- Can representation analysis methods detect complex features models compute with?
- What structural framework prevents LLM explanations from becoming just plausible fiction?
- How do mechanistic features compare to natural language for interpretability?
- Can models be trained to hide causal influences in their explanations?
- How do mechanistic interpretability tools help distinguish truthfulness from honesty?
- How should we rethink the symbolism versus connectionism debate in light of LLMs?
- Why do LLMs reason fluently about causality but lack causal rigor?
- What prevents LLM representations from causally influencing generation outputs?
- Can a Reflect mechanism detect and revise failed causal predictions?
- How does causal structure avoid behaviorist limitations in LLM social simulation?
- Why does masking future experts guarantee causal validity without external verification?
- What architectural changes would help LLMs distinguish causal relationships from temporal sequences?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can cognitive science methods unlock how LLMs actually work?
Does Marr's three-level framework—developed to understand biological minds—offer interpretability researchers the structured methodology they need to decode opaque language models?
same paper, the framework
-
Can we predict where language models will fail?
Does characterizing the abstract computational problem an LLM solves—as a probability machine over sequences—let us predict which tasks it will struggle with systematically, before running experiments?
same paper, computational level companion
-
Can indirect psychology tests reveal what LLMs conceal about bias?
Alignment training teaches LLMs to refuse direct questions about bias, but do implicit psychological methods like the IAT expose the underlying associations that remain encoded in their representations?
same paper, algorithmic level companion
-
Do language model reasoning drafts faithfully represent their actual computation?
If models externalize reasoning in thinking drafts before answering, does the draft accurately reflect their internal process? This matters for AI safety monitoring and error detection.
adjacent: dual-dimension methodology in CoT
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Levels of Analysis for Large Language Models
- Mechanistic Indicators of Understanding in Large Language Models
- AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders
- Computational structuralism: Toward a formal theory of meaning in the age of digital intelligence
- Language Models’ Hall of Mirrors Problem: Why AI Alignment Requires Peircean Semiosis
- LLM Reasoning Is Latent, Not the Chain of Thought
- Explain-Query-Test: Self-Evaluating LLMs Via Explanation and Comprehension Discrepancy
- Making Reasoning Matter: Measuring and Improving Faithfulness of Chain-of-Thought Reasoning
Original note title
mechanistic understanding of LLMs requires both representational analysis and causal analysis — either alone is insufficient