What test distinguishes genuine compositionality from fractured feature presence?
This explores how you'd actually tell whether a model has *built* compositional structure versus merely having the right pieces sitting around in a disorganized heap — and why the obvious test (can I read the features out?) turns out to be the wrong one.
This question is really about a trap: a model can hold every feature a task needs and still not be composing them. The tempting test — can I linearly decode the constituent features from the hidden activations? — is exactly the one that fails. One line of work shows that models trained with ordinary gradient descent can carry *all* the linearly-decodable features for a task while their internal organization is fundamentally broken; the breakage is invisible to accuracy and only shows up under perturbation or distribution shift Can models be smart without organized internal structure?. So 'the feature is present and readable' is the fractured-presence signature, not the compositionality signature. Worse, a separate result finds that linear decodability of constituents *reliably predicts* compositional success — but only when training data already covers the combinations Can neural networks learn compositional skills without symbolic mechanisms?. Read together, those two say the quiet part out loud: decodability tracks whether you've *seen* the pieces combined, not whether the model can combine them itself.
The test that actually discriminates is novelty under load. Transformers that look compositional often turn out to be memorizing computation subgraphs from training and stitching them by pattern-match; push them to *novel* compositions and they fail sharply, with errors compounding step by step Do transformers actually learn systematic compositional reasoning?. The same shape appears in language: grammatical competence degrades predictably as syntactic depth and embedding increase, which is what you'd expect from surface heuristics rather than a recursive rule Does LLM grammatical performance decline with structural complexity?, Why do large language models fail at complex linguistic tasks?. Genuine compositionality should be roughly flat as you recombine and nest; fractured presence falls off a cliff exactly where recombination starts.
The stronger, less foolable test is causal rather than behavioral: cut a piece out and see if only the matching function breaks. Pruning and ablation experiments show that networks which truly compose implement subroutines in *isolated* subnetworks — ablate one and only its corresponding function degrades — and that pretraining makes this modular separation more consistent Do neural networks naturally learn modular compositional structure?. That's the cleanest distinguisher in the corpus: not 'is the feature readable' but 'is the feature *separable and intervenable*.' A fractured model has the features entangled, so ablations smear across functions; a compositional one has them factored, so ablations are surgical.
There's a useful cross-domain echo here. The same recall-vs-structure gap shows up in verification, where pooled cosine similarity will happily accept a 'structural near-miss' that has all the right tokens in the wrong arrangement — and the fix is a verifier that operates on the full token-token interaction map rather than a compressed vector Can verification separate structural near-misses from topical matches?. That's the same lesson in miniature: presence of the right parts (high recall, decodable features) is a different and weaker thing than correct relational structure, and you only catch the difference by looking at *interactions*, not at compressed summaries.
The thing you didn't know you wanted to know: there's a hopeful flip side. Scaling can produce real compositional generalization with no architectural tricks at all — *provided the training distribution covers the combinations* Can neural networks learn compositional skills without symbolic mechanisms?. So 'fractured feature presence' isn't a permanent verdict on a model; it's often a verdict on its training coverage. The discriminating test — held-out novel compositions plus surgical ablation — is therefore also a diagnostic for *what to feed the model next*, not just a pass/fail stamp.
Sources 7 notes
Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.
Standard MLPs achieve compositional generalization through data and model scaling alone, without architectural modifications, provided the training distribution sufficiently covers combinations of task modules. Linear decodability of constituents from hidden activations reliably predicts success.
Research shows transformers succeed on in-distribution tasks by memorizing computation subgraphs from training data, not by learning systematic rules. They fail drastically on novel compositions, with errors compounding across reasoning steps.
LLMs show systematic performance decline as syntactic depth and embedding increase. Simple sentences are handled well while complex structures with recursion and embedding fail consistently, suggesting LLMs learned surface heuristics rather than structural grammar rules.
Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.
Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.
A two-stage pipeline—pooled-cosine recall followed by a small Transformer verifier operating on token-token similarity maps—reliably rejects structural near-misses that MaxSim-style late interaction cannot. The verifier succeeds because it operates on full token interaction patterns rather than compressed vectors.