What distinguishes a representational feature from a causally inert correlation?

This explores how researchers tell whether a pattern found inside a model actually does work in the computation (a feature) versus just happening to be present without driving any behavior (a correlation that's along for the ride).

This explores the difference between a pattern a model genuinely *uses* and one that merely *appears* when you look — and the corpus is unusually direct that you cannot tell them apart by looking alone. The cleanest answer comes from work arguing that mechanistic understanding needs two separate moves: representational analysis to locate a candidate feature, then causal analysis to confirm it actually changes behavior Can we understand LLM mechanisms with only representational analysis?. Representation alone gives you correlation; it tells you something is decodable from the activations, not that the model leans on it. The line between feature and inert correlation is drawn by intervention — steer it, ablate it, and see if the output moves.

Why is looking insufficient? Because the analysis tools themselves are biased toward what's easy to see. Standard methods like PCA, linear regression, and RSA over-represent simple linear structure and under-represent equally real nonlinear features — and, strikingly, a network using homomorphic encryption can compute perfectly while showing *no* interpretable activation structure at all, proving representation and computation can fully decouple Do standard analysis methods hide nonlinear features in neural networks?. So a clean-looking linear direction might be causally inert, and a real computational feature might be invisible to your probe. Relatedly, a model can carry every linearly decodable feature a task needs while its internal organization is fractured and broken — the decodability is real but the structure that would make it robust isn't there Can models be smart without organized internal structure?.

The positive case — what a confirmed feature looks like — shows up in steering work. A single SAE-identified reasoning feature, when directly steered, matches or beats chain-of-thought across six model families and activates early enough to override surface instructions Can we trigger reasoning without explicit chain-of-thought prompts?. That's the gold standard: not 'this correlates with reasoning' but 'push this and reasoning happens.' Sparse-weight training pushes the same idea into the architecture, building circuits where ablation studies confirm specific neurons are necessary and sufficient for a task — necessity and sufficiency being exactly the causal tests a mere correlation fails Can sparse weight training make neural networks interpretable by design?.

The distinction also reaches beyond interpretability into how we model reasoning itself. Causal belief networks are powerful but can't capture associative or analogical links — which means some real cognitive relationships are non-causal by nature, a useful reminder that 'correlation' isn't always a failure to find causation Can causal models alone capture how humans actually reason?. And LLMs reproduce human causal-reasoning errors like weak explaining-away from training-data statistics, a case where a behavioral regularity traces to surface correlation rather than a genuine causal mechanism inside the model Do large language models make the same causal reasoning mistakes as humans?.

The thing worth walking away with: a representational feature earns its name by surviving intervention. Everything decodable is a candidate; only what moves the output when you perturb it is real. The whole field's anxiety is that its favorite measuring sticks reward the candidates that are easiest to see, not the ones that actually do the work.

Sources 7 notes

Can we understand LLM mechanisms with only representational analysis?

Representational analysis alone identifies correlations without causation; causal analysis alone shows behavioral effects without explaining them. Only paired methods—locating candidate features representationally, then verifying causally—produce complete mechanistic claims.

Do standard analysis methods hide nonlinear features in neural networks?

PCA, linear regression, and RSA over-represent simple linear features while under-representing equally important nonlinear features. Homomorphic encryption demonstrates that networks can compute perfectly well with no interpretable activation structure, proving representation patterns and computation can be entirely decoupled.

Can models be smart without organized internal structure?

Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.

Can we trigger reasoning without explicit chain-of-thought prompts?

SAE-identified reasoning features can be directly steered to match or exceed chain-of-thought performance across six model families. This reasoning mode activates early in generation and overrides surface-level instructions, suggesting latent reasoning is a fundamental capability independent of explicit prompting.

Can sparse weight training make neural networks interpretable by design?

Training transformers with sparse weights creates compact, human-interpretable circuits where neurons correspond to simple concepts with clear connections. Ablation studies confirm these circuits are necessary and sufficient for task performance, though scaling beyond tens of millions of parameters while maintaining interpretability remains unsolved.

Can causal models alone capture how humans actually reason?

Causal belief networks excel at modeling causal reasoning but cannot represent associative links, analogical mappings, or emotion-driven belief shifts. The GenMinds framework itself acknowledges this as a tractable starting point rather than a complete theory.

Do large language models make the same causal reasoning mistakes as humans?

LLMs show weak explaining away and Markov violations in collider networks, matching human error patterns exactly. This suggests shared mechanisms rooted in training data statistics rather than categorical reasoning inferiority.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a mechanistic interpretability researcher. The question remains urgent: **What distinguishes a representational feature from a causally inert correlation?** A curated library (spanning 2024–2026) argues the answer is *intervention*, not inspection — but the field's measurement tools are biased toward easy patterns, and newer models may have altered what's visible.

**What a curated library found — and when (dated claims, not current truth):**
- Representational analysis alone yields correlation; causality requires separate intervention (ablation, steering, ablation-and-resynthesis). ~2025.
- Standard probes (PCA, linear regression, RSA) systematically over-weight linear structure and miss nonlinear features; a homomorphic-encrypted net computes perfectly with *no* interpretable activation structure, proving representation and computation decouple. ~2025.
- SAE-identified reasoning features, when steered, match or exceed chain-of-thought across six model families, activating early enough to override surface instructions — the gold standard for confirmed causality. ~2025–2026.
- Sparse-weight training builds circuits where ablation studies confirm necessity and sufficiency; these are the causal tests a mere correlation fails. ~2025.
- LLMs reproduce human causal-reasoning errors (weak explaining-away, Markov violations) traceable to surface statistics rather than internal causal mechanisms. ~2025.

**Anchor papers (verify; mind their dates):**
- arXiv:2507.08017 (2025-07): *Mechanistic Indicators of Understanding in Large Language Models*
- arXiv:2511.13653 (2025-11): *Weight-sparse transformers have interpretable circuits*
- arXiv:2502.10215 (2025-02): *Do Large Language Models Reason Causally Like Us?*
- arXiv:2507.22216 (2025-07): *Representation biases: will we achieve complete understanding by analyzing representations*

**Your task:**
(1) **RE-TEST THE INTERVENTION STANDARD.** Every finding above assumes intervention (steering, ablation) is *feasible and trustworthy*. Has mechanistic work since ~2026-06 shown interventions themselves can be deceptive — e.g., that steering a SAE feature may not isolate its causal role, or that ablation masks distributional shift? Judge separately: Is the *durable question* (feature vs. inert correlation) still open under current methods? Where does the intervention standard still hold? Where has it cracked?

(2) **Surface the strongest tension.** The library claims probes are biased toward linearity, yet steering work succeeds on SAE features (which are *sparse*, often nonlinear in interaction). Does recent work (last 6 mo.) reconcile or sharpen this? Does it show linearity bias is *sometimes* masking, *sometimes* isolating the right thing?

(3) **Propose 2 research questions assuming the regime shifted:** (a) If models have since shifted architecture/training to *intentionally decorrelate* useful features (e.g., to resist extraction), how would you distinguish causality then? (b) If multi-agent or ensemble behavior is now the norm, does the feature/correlation distinction break down across system boundaries?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

What distinguishes a representational feature from a causally inert correlation?

Sources 7 notes

Next inquiring lines