What makes linear decodability a reliable signal of compositionality?
This explores why being able to read a task's building blocks off a model's hidden activations with a simple linear probe tends to predict that the model can recombine those blocks into novel wholes — and where that signal breaks down.
This explores when 'linear decodability' — the ability to recover a task's component pieces from a model's internal activations using nothing fancier than a linear readout — actually tracks compositionality, the capacity to recombine known parts in new ways. The cleanest version of the claim comes from work showing that plain MLPs achieve compositional generalization through data and model scaling alone, with no symbolic machinery, and that linear decodability of the constituents from hidden activations reliably predicts whether they'll succeed Can neural networks learn compositional skills without symbolic mechanisms?. The intuition: if each ingredient is sitting in the representation as its own separable direction, the model has factored the problem rather than memorized it whole — and factored parts are the prerequisite for recombining them.
What makes the signal trustworthy is that it lines up with independent structural evidence pointing the same way. Pruning studies find that networks spontaneously route compositional subroutines into isolated subnetworks, so ablating one piece touches only its function — a physical correlate of 'the parts are separable,' strengthened further by pretraining Do neural networks naturally learn modular compositional structure?. And probing work shows models don't just scatter these features randomly: they encode syntactic type and direction in a structured polar geometry, the kind of organized, symbol-compatible layout that linear probes can cleanly read How do language models encode syntactic relations geometrically?. When constituents are both modular and geometrically organized, linear decodability is measuring something real about the internal factorization, not an accident of the probe.
But the same corpus warns sharply against treating it as a guarantee. Models trained with SGD can carry every linearly decodable feature a task needs while their broader internal organization stays fractured and brittle — invisible to standard accuracy metrics yet exposed by perturbation and distribution shift Can models be smart without organized internal structure?. So decodability tells you the parts are *present*; it doesn't tell you they're *bound* together robustly. That binding gap is exactly the failure the compositionality literature keeps circling: networks struggle to dynamically tie separated pieces into reusable structure Why do neural networks fail at compositional generalization?, and transformers in particular often 'succeed' by matching memorized computation subgraphs rather than applying rules — collapsing the moment they meet a genuinely novel combination Do transformers actually learn systematic compositional reasoning?.
Put together, the corpus suggests linear decodability is a reliable signal of compositionality precisely when it co-occurs with coverage and structure: the training distribution has to actually span the combinations of modules, and the parts have to live in a separable, organized representation rather than a tangled one. It's a strong *necessary* indicator — you rarely recombine parts you can't even locate — but a weak *sufficient* one, because the same readable features can sit inside an organization that doesn't bind or transfer. The honest read is that decodability is a doorway, not a verdict: it earns its reliability only alongside modularity evidence and out-of-distribution tests, which is also why grammatical competence degrades predictably as structural complexity climbs Does LLM grammatical performance decline with structural complexity? — the parts were decodable, the recombination wasn't there.
Sources 7 notes
Standard MLPs achieve compositional generalization through data and model scaling alone, without architectural modifications, provided the training distribution sufficiently covers combinations of task modules. Linear decodability of constituents from hidden activations reliably predicts success.
Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.
The Polar Probe shows LLMs represent syntactic type and direction through both distance and angular position between embeddings, nearly doubling accuracy over distance-only methods. This demonstrates neural networks spontaneously learn structured, symbolic-compatible geometry.
Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.
Greff et al. argue that neural networks cannot dynamically bind distributed information into compositional structures due to three failures: segregating entities from inputs, maintaining representational separation, and reusing learned structure in novel combinations. Scaling can partially overcome this by enabling compositional representations to emerge.
Research shows transformers succeed on in-distribution tasks by memorizing computation subgraphs from training data, not by learning systematic rules. They fail drastically on novel compositions, with errors compounding across reasoning steps.
LLMs show systematic performance decline as syntactic depth and embedding increase. Simple sentences are handled well while complex structures with recursion and embedding fail consistently, suggesting LLMs learned surface heuristics rather than structural grammar rules.