How do ablation studies reveal function without representational characterization?

This explores a methodological split in interpretability: ablation (knock out a part, watch what breaks) tells you what a component *does* without ever telling you what it *represents* — and whether that's a gap or a feature.

This explores why you can learn a network's function by deleting pieces of it without ever decoding what those pieces *mean* — and what that asymmetry says about interpretability. Ablation is a causal test: remove a subnetwork, see which behavior collapses, and you've shown that part is necessary for that function. It says nothing about how the function is encoded. The corpus suggests this isn't a weakness so much as a division of labor — and sometimes the only honest tool you have.

The cleanest case for ablation-as-function comes from work showing networks quietly carve themselves into modular subnetworks: pruning experiments find that compositional subroutines live in isolated pieces, and ablating one affects only its matching function and nothing else Do neural networks naturally learn modular compositional structure?. That's pure functional carving — you've mapped a part to a job without reading a single activation. Weight-sparsity work pushes the same logic toward proof: in forced-modular transformers, ablations confirm a circuit is both *necessary and sufficient* for a task Can sparse weight training make neural networks interpretable by design?. GRAM's ablations do it for a training method rather than a region — by knocking out components, they show the gains come from the variational framework, not from added randomness Does adding randomness to recursive models actually help reasoning?.

The deeper reason representational characterization keeps failing to keep pace is that representation and computation can come apart entirely. Standard analysis tools (PCA, RSA, linear regression) are systematically biased toward simple linear features and miss nonlinear ones — and homomorphic encryption makes the point brutally: a network can compute perfectly with *no* interpretable activation structure at all Do standard analysis methods hide nonlinear features in neural networks?. If the representation can be unreadable while the function is intact, then a behavioral knock-out test is sometimes the *only* thing that survives. This is why the field's answer isn't 'pick one' but 'pair them': representational analysis finds correlations without causation, causal/ablation analysis shows effects without explaining them, and only locating a candidate feature representationally *then* verifying it causally yields a complete mechanistic claim Can we understand LLM mechanisms with only representational analysis?.

There's a sharp cautionary flip side. The 'fractured entangled representation' work shows two networks can produce identical outputs across every input while having radically different, tangled internal structure — and weight perturbations expose the fracture that benchmarks can't Can identical outputs hide broken internal representations?. A model can pass every test and still be internally incoherent Can AI pass every test while understanding nothing?. That cuts both ways for ablation: a clean knock-out result tells you a part is load-bearing, but it can't tell you whether the underlying representation is clean or a fragile tangle that will shatter on transfer. Notably, where representational structure *is* legible, it can predict function directly — linear decodability of task constituents from hidden states reliably forecasts compositional success Can neural networks learn compositional skills without symbolic mechanisms?, which is exactly the complementary move ablation can't make.

So the honest synthesis: ablation reveals function precisely *because* it sidesteps representation — it asks 'does behavior survive removal?' rather than 'what is stored here?' That makes it robust when representations are unreadable, but blind to whether a working circuit is well-formed or merely well-behaved. Function and structure are separable axes, and the most reliable claims pin down both.

Sources 8 notes

Do neural networks naturally learn modular compositional structure?

Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.

Can sparse weight training make neural networks interpretable by design?

Training transformers with sparse weights creates compact, human-interpretable circuits where neurons correspond to simple concepts with clear connections. Ablation studies confirm these circuits are necessary and sufficient for task performance, though scaling beyond tens of millions of parameters while maintaining interpretability remains unsolved.

Does adding randomness to recursive models actually help reasoning?

GRAM's ablations show naive stochasticity added to existing recursive models yields no improvement. Gains come specifically from amortized variational inference, which couples sampling to a principled generative objective and learns where to branch rather than injecting undirected noise.

Do standard analysis methods hide nonlinear features in neural networks?

PCA, linear regression, and RSA over-represent simple linear features while under-representing equally important nonlinear features. Homomorphic encryption demonstrates that networks can compute perfectly well with no interpretable activation structure, proving representation patterns and computation can be entirely decoupled.

Can we understand LLM mechanisms with only representational analysis?

Representational analysis alone identifies correlations without causation; causal analysis alone shows behavioral effects without explaining them. Only paired methods—locating candidate features representationally, then verifying causally—produce complete mechanistic claims.

Can identical outputs hide broken internal representations?

Networks trained with SGD reproduce outputs perfectly while having radically different internal structure than evolved networks, with weight perturbations revealing fractured, entangled representations that prevent transfer to novel contexts or creative recombination.

Can AI pass every test while understanding nothing?

The Fractured Entangled Representation hypothesis shows that SGD-trained networks can produce identical outputs across all inputs while maintaining radically different internal representations. Standard benchmarks cannot detect this structural difference.

Can neural networks learn compositional skills without symbolic mechanisms?

Standard MLPs achieve compositional generalization through data and model scaling alone, without architectural modifications, provided the training distribution sufficiently covers combinations of task modules. Linear decodability of constituents from hidden activations reliably predicts success.

How do ablation studies reveal function without representational characterization?

Sources 8 notes

Next inquiring lines