Do feature extraction methods systematically miss computationally important complex features?

This explores whether the tools we use to inspect what a neural network has learned — PCA, linear probes, similarity analysis — are blind to exactly the complex, nonlinear features that do the real computational work.

This explores whether our standard inspection tools are blind to the complex features that actually drive computation. The corpus says yes, and unusually directly: the methods most analysts reach for are systematically biased toward simple features. Do standard analysis methods hide nonlinear features in neural networks? shows that PCA, linear regression, and RSA over-represent linearly decodable structure while under-representing equally important nonlinear features. The sharpest demonstration is an existence proof — a homomorphically encrypted network computes perfectly with no interpretable activation structure at all, which means a representation pattern and the computation it supposedly explains can be completely decoupled. If a network can compute well while showing analysts nothing, then 'nothing visible' tells you nothing about what's being computed.

The quieter danger is the inverse: a clean-looking representation that is actually broken underneath. Can models be smart without organized internal structure? finds models that contain all the linearly decodable features a task needs while their internal organization is fractured — fragile to perturbation and distribution shift in ways that accuracy and linear-probe scores never reveal. So the bias cuts both ways: simple-feature methods can miss real computation that's there, and can certify organization that isn't. Either way, what's legible to the probe is not what's load-bearing in the model.

Where does the missing complexity actually live? Two notes suggest it hides in interactions rather than in any single direction. Can verification separate structural near-misses from topical matches? shows that a verifier reading full token-to-token similarity maps catches structural near-misses that compressed, pooled vectors cannot — the signal exists, but only in the interaction pattern, which is precisely what dimensionality reduction throws away. And Which tokens in reasoning chains actually matter most? shows models internally rank tokens by functional role, preserving symbolic-computation tokens while discarding grammar and filler. The importance structure is real and recoverable — but only if you look at the right granularity, not a global summary.

There's a cross-domain echo worth noticing. Why do large language models fail at complex linguistic tasks? finds that LLM errors get predictably worse as syntactic structure deepens — the model captures surface patterns but not the compositional rule. That mirrors the analysis problem one level up: complexity that arises from composition and nesting is exactly what both the models and the tools that inspect them tend to flatten. Relatedly, Why does removing spurious cues sometimes hurt model performance? reframes a failure as integrating conflicting signals rather than filtering distractors — a reminder that the interesting behavior is often a composition of features, not a selection among them, and composition is what simple extraction methods are worst at seeing.

The takeaway you didn't know you wanted: 'we found a clean linear feature' and 'we understand the computation' are nearly independent claims. The corpus suggests that better interpretability may depend less on finding tidier directions and more on learning to read interaction structure — token-token maps, functional rankings, compositional depth — at the granularity where the hard features actually live.

Sources 6 notes

Do standard analysis methods hide nonlinear features in neural networks?

PCA, linear regression, and RSA over-represent simple linear features while under-representing equally important nonlinear features. Homomorphic encryption demonstrates that networks can compute perfectly well with no interpretable activation structure, proving representation patterns and computation can be entirely decoupled.

Can models be smart without organized internal structure?

Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.

Can verification separate structural near-misses from topical matches?

A two-stage pipeline—pooled-cosine recall followed by a small Transformer verifier operating on token-token similarity maps—reliably rejects structural near-misses that MaxSim-style late interaction cannot. The verifier succeeds because it operates on full token interaction patterns rather than compressed vectors.

Which tokens in reasoning chains actually matter most?

Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.

Why do large language models fail at complex linguistic tasks?

Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.

Why does removing spurious cues sometimes hurt model performance?

Removing spurious cues degrades performance in heuristic override tasks, opposite to shortcut learning predictions. The failure mode is integrating conflicting signals rather than ignoring distractors—a frame problem, not feature selection.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an interpretability researcher evaluating whether feature extraction methods structurally miss computationally important complex features. Treat this as an open question.

What a curated library found — and when (findings from 2023–2026, perishable claims):
• PCA, linear regression, and RSA over-represent linearly decodable structure while under-representing nonlinear features; homomorphically encrypted networks compute perfectly with no interpretable activation structure, decoupling representation patterns from computation (2025–2026).
• Models can exhibit all linearly decodable features needed for a task while their internal organization is fragile to perturbation and distribution shift — accuracy and linear-probe scores mask fundamental fragility (2025).
• Token-to-token similarity maps and functional-importance rankings capture real structure that dimensionality reduction discards; complex features hide in interaction patterns, not single directions (2026).
• LLM errors worsen predictably as syntactic structure deepens — models capture surface patterns but not compositional rules, mirroring how inspection tools flatten composition (2025).
• Heuristic override in reasoning is reframed as integration of conflicting signals rather than filtering — interesting behavior arises from feature composition, not selection (2026).

Anchor papers (verify; mind their dates):
• arXiv:2507.22216 (2025): Representation biases and whether we can achieve complete understanding via representation analysis.
• arXiv:2601.03066 (2026): Do LLMs Encode Functional Importance of Reasoning Tokens?
• arXiv:2603.29025 (2026): The Model Says Walk — How Surface Heuristics Override Implicit Constraints in LLM Reasoning.
• arXiv:2503.19260 (2025): Linguistic Blind Spots of Large Language Models.

Your task:
(1) RE-TEST EACH CONSTRAINT. For each finding, assess whether newer models, methods (sparse autoencoders, mechanistic interpretability), evaluation harnesses, or multi-agent orchestration have since relaxed or overturned it. Separate the durable question — do our tools systematically miss compositional/interaction structure? — from perishable limitations. Where do linear probes, SAEs, or token-importance ranking now succeed or fail?
(2) Surface the strongest work from the last ~6 months that CONTRADICTS the library's claim that simple methods are structurally blind to complex features — or that offers a competing explanation for why linear probes succeed despite compositional complexity.
(3) Propose 2 research questions that assume the regime may have shifted: e.g., (a) Do current SAE dictionaries capture interaction features, and if not, why? (b) Can functional-importance hierarchies be scaled to explain reasoning in models >100B parameters?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Do feature extraction methods systematically miss computationally important complex features?

Sources 6 notes

Next inquiring lines