What distinguishes a representational feature from a causally inert correlation?
This explores how researchers tell whether a pattern found inside a model actually does work in the computation (a feature) versus just happening to be present without driving any behavior (a correlation that's along for the ride).
This explores the difference between a pattern a model genuinely *uses* and one that merely *appears* when you look — and the corpus is unusually direct that you cannot tell them apart by looking alone. The cleanest answer comes from work arguing that mechanistic understanding needs two separate moves: representational analysis to locate a candidate feature, then causal analysis to confirm it actually changes behavior Can we understand LLM mechanisms with only representational analysis?. Representation alone gives you correlation; it tells you something is decodable from the activations, not that the model leans on it. The line between feature and inert correlation is drawn by intervention — steer it, ablate it, and see if the output moves.
Why is looking insufficient? Because the analysis tools themselves are biased toward what's easy to see. Standard methods like PCA, linear regression, and RSA over-represent simple linear structure and under-represent equally real nonlinear features — and, strikingly, a network using homomorphic encryption can compute perfectly while showing *no* interpretable activation structure at all, proving representation and computation can fully decouple Do standard analysis methods hide nonlinear features in neural networks?. So a clean-looking linear direction might be causally inert, and a real computational feature might be invisible to your probe. Relatedly, a model can carry every linearly decodable feature a task needs while its internal organization is fractured and broken — the decodability is real but the structure that would make it robust isn't there Can models be smart without organized internal structure?.
The positive case — what a confirmed feature looks like — shows up in steering work. A single SAE-identified reasoning feature, when directly steered, matches or beats chain-of-thought across six model families and activates early enough to override surface instructions Can we trigger reasoning without explicit chain-of-thought prompts?. That's the gold standard: not 'this correlates with reasoning' but 'push this and reasoning happens.' Sparse-weight training pushes the same idea into the architecture, building circuits where ablation studies confirm specific neurons are necessary and sufficient for a task — necessity and sufficiency being exactly the causal tests a mere correlation fails Can sparse weight training make neural networks interpretable by design?.
The distinction also reaches beyond interpretability into how we model reasoning itself. Causal belief networks are powerful but can't capture associative or analogical links — which means some real cognitive relationships are non-causal by nature, a useful reminder that 'correlation' isn't always a failure to find causation Can causal models alone capture how humans actually reason?. And LLMs reproduce human causal-reasoning errors like weak explaining-away from training-data statistics, a case where a behavioral regularity traces to surface correlation rather than a genuine causal mechanism inside the model Do large language models make the same causal reasoning mistakes as humans?.
The thing worth walking away with: a representational feature earns its name by surviving intervention. Everything decodable is a candidate; only what moves the output when you perturb it is real. The whole field's anxiety is that its favorite measuring sticks reward the candidates that are easiest to see, not the ones that actually do the work.
Sources 7 notes
Representational analysis alone identifies correlations without causation; causal analysis alone shows behavioral effects without explaining them. Only paired methods—locating candidate features representationally, then verifying causally—produce complete mechanistic claims.
PCA, linear regression, and RSA over-represent simple linear features while under-representing equally important nonlinear features. Homomorphic encryption demonstrates that networks can compute perfectly well with no interpretable activation structure, proving representation patterns and computation can be entirely decoupled.
Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.
SAE-identified reasoning features can be directly steered to match or exceed chain-of-thought performance across six model families. This reasoning mode activates early in generation and overrides surface-level instructions, suggesting latent reasoning is a fundamental capability independent of explicit prompting.
Training transformers with sparse weights creates compact, human-interpretable circuits where neurons correspond to simple concepts with clear connections. Ablation studies confirm these circuits are necessary and sufficient for task performance, though scaling beyond tens of millions of parameters while maintaining interpretability remains unsolved.
Causal belief networks excel at modeling causal reasoning but cannot represent associative links, analogical mappings, or emotion-driven belief shifts. The GenMinds framework itself acknowledges this as a tractable starting point rather than a complete theory.
LLMs show weak explaining away and Markov violations in collider networks, matching human error patterns exactly. This suggests shared mechanisms rooted in training data statistics rather than categorical reasoning inferiority.