INQUIRING LINE

Can representation analysis methods detect complex features models compute with?

This explores whether the standard toolkit for reading a model's internals — probes, PCA, regression — can actually surface the complex computations a model performs, or whether those methods quietly miss what matters.


This question is really asking whether our analysis tools see what the model is doing, or only the parts of it that happen to be easy to see — and the corpus leans hard toward the second answer. The sharpest result is that standard methods are systematically biased toward simple, linear features Do standard analysis methods hide nonlinear features in neural networks?. PCA, linear regression, and RSA over-represent clean linear structure while under-counting equally important nonlinear features. The striking demonstration: a network can compute a task perfectly using homomorphically encrypted activations that show no interpretable structure at all — proof that what a model represents and what a model computes can be fully decoupled. So a probe coming up empty doesn't mean the computation isn't there.

That decoupling shows up from a second angle, too. Two models can post identical accuracy while one has clean internal organization and the other is internally fractured — and the difference is invisible to standard metrics, surfacing only under perturbation or distribution shift Can models be smart without organized internal structure?. Linear decodability, the very thing a probe rewards, can sit on top of broken internal structure. Performance tells you the features are usable; it tells you nothing about whether they're organized the way you assume.

The corpus also names the fix. Representational analysis alone only ever finds correlations — it locates candidate features but can't show they're the ones the model uses. Pairing it with causal analysis (intervene, ablate, watch behavior change) is what turns a correlation into a mechanistic claim Can we understand LLM mechanisms with only representational analysis?. This is the working answer to your question: representation analysis can *propose* complex features, but only causal verification confirms the model computes with them.

Where the methods get smarter, they do find genuinely complex structure — which is the encouraging counterweight. A polar-coordinate probe recovers syntactic type *and* direction from activations, nearly doubling accuracy over distance-only probes precisely because it stopped assuming the geometry was simple How do language models encode syntactic relations geometrically?. Circuit tracing in Claude models reveals a four-tier feature hierarchy running from tokens to abstract concepts to functional operations How do language models organize features across processing layers?, and pruning experiments expose modular subnetworks each implementing an isolated compositional subroutine Do neural networks naturally learn modular compositional structure?. The pattern: complexity is detectable, but only when the method is built to expect the right shape.

The quiet warning underneath all this is that detecting a feature isn't the same as the feature being what you think. Transformers that look like they reason compositionally are often just matching memorized computation subgraphs, collapsing the moment the composition is novel Do transformers actually learn systematic compositional reasoning?. The lesson worth leaving with: representation analysis is a generator of hypotheses about complex computation, not a verifier of it — and the most confident-looking probe result is exactly the one most worth testing causally before you believe it.


Sources 7 notes

Do standard analysis methods hide nonlinear features in neural networks?

PCA, linear regression, and RSA over-represent simple linear features while under-representing equally important nonlinear features. Homomorphic encryption demonstrates that networks can compute perfectly well with no interpretable activation structure, proving representation patterns and computation can be entirely decoupled.

Can models be smart without organized internal structure?

Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.

Can we understand LLM mechanisms with only representational analysis?

Representational analysis alone identifies correlations without causation; causal analysis alone shows behavioral effects without explaining them. Only paired methods—locating candidate features representationally, then verifying causally—produce complete mechanistic claims.

How do language models encode syntactic relations geometrically?

The Polar Probe shows LLMs represent syntactic type and direction through both distance and angular position between embeddings, nearly doubling accuracy over distance-only methods. This demonstrates neural networks spontaneously learn structured, symbolic-compatible geometry.

How do language models organize features across processing layers?

Circuit tracing in Claude models reveals features progress from token-level inputs to abstract concepts to functional operations to outputs. Larger models develop richer abstract features, suggesting scaling enables higher-level conceptual reasoning rather than pattern memorization.

Do neural networks naturally learn modular compositional structure?

Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.

Do transformers actually learn systematic compositional reasoning?

Research shows transformers succeed on in-distribution tasks by memorizing computation subgraphs from training data, not by learning systematic rules. They fail drastically on novel compositions, with errors compounding across reasoning steps.

Next inquiring lines