What test distinguishes genuine compositionality from fractured feature presence?

This explores how you'd actually tell whether a model has *built* compositional structure versus merely having the right pieces sitting around in a disorganized heap — and why the obvious test (can I read the features out?) turns out to be the wrong one.

This question is really about a trap: a model can hold every feature a task needs and still not be composing them. The tempting test — can I linearly decode the constituent features from the hidden activations? — is exactly the one that fails. One line of work shows that models trained with ordinary gradient descent can carry *all* the linearly-decodable features for a task while their internal organization is fundamentally broken; the breakage is invisible to accuracy and only shows up under perturbation or distribution shift Can models be smart without organized internal structure?. So 'the feature is present and readable' is the fractured-presence signature, not the compositionality signature. Worse, a separate result finds that linear decodability of constituents *reliably predicts* compositional success — but only when training data already covers the combinations Can neural networks learn compositional skills without symbolic mechanisms?. Read together, those two say the quiet part out loud: decodability tracks whether you've *seen* the pieces combined, not whether the model can combine them itself.

The test that actually discriminates is novelty under load. Transformers that look compositional often turn out to be memorizing computation subgraphs from training and stitching them by pattern-match; push them to *novel* compositions and they fail sharply, with errors compounding step by step Do transformers actually learn systematic compositional reasoning?. The same shape appears in language: grammatical competence degrades predictably as syntactic depth and embedding increase, which is what you'd expect from surface heuristics rather than a recursive rule Does LLM grammatical performance decline with structural complexity?, Why do large language models fail at complex linguistic tasks?. Genuine compositionality should be roughly flat as you recombine and nest; fractured presence falls off a cliff exactly where recombination starts.

The stronger, less foolable test is causal rather than behavioral: cut a piece out and see if only the matching function breaks. Pruning and ablation experiments show that networks which truly compose implement subroutines in *isolated* subnetworks — ablate one and only its corresponding function degrades — and that pretraining makes this modular separation more consistent Do neural networks naturally learn modular compositional structure?. That's the cleanest distinguisher in the corpus: not 'is the feature readable' but 'is the feature *separable and intervenable*.' A fractured model has the features entangled, so ablations smear across functions; a compositional one has them factored, so ablations are surgical.

There's a useful cross-domain echo here. The same recall-vs-structure gap shows up in verification, where pooled cosine similarity will happily accept a 'structural near-miss' that has all the right tokens in the wrong arrangement — and the fix is a verifier that operates on the full token-token interaction map rather than a compressed vector Can verification separate structural near-misses from topical matches?. That's the same lesson in miniature: presence of the right parts (high recall, decodable features) is a different and weaker thing than correct relational structure, and you only catch the difference by looking at *interactions*, not at compressed summaries.

The thing you didn't know you wanted to know: there's a hopeful flip side. Scaling can produce real compositional generalization with no architectural tricks at all — *provided the training distribution covers the combinations* Can neural networks learn compositional skills without symbolic mechanisms?. So 'fractured feature presence' isn't a permanent verdict on a model; it's often a verdict on its training coverage. The discriminating test — held-out novel compositions plus surgical ablation — is therefore also a diagnostic for *what to feed the model next*, not just a pass/fail stamp.

Sources 7 notes

Can models be smart without organized internal structure?

Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.

Can neural networks learn compositional skills without symbolic mechanisms?

Standard MLPs achieve compositional generalization through data and model scaling alone, without architectural modifications, provided the training distribution sufficiently covers combinations of task modules. Linear decodability of constituents from hidden activations reliably predicts success.

Do transformers actually learn systematic compositional reasoning?

Research shows transformers succeed on in-distribution tasks by memorizing computation subgraphs from training data, not by learning systematic rules. They fail drastically on novel compositions, with errors compounding across reasoning steps.

Does LLM grammatical performance decline with structural complexity?

LLMs show systematic performance decline as syntactic depth and embedding increase. Simple sentences are handled well while complex structures with recursion and embedding fail consistently, suggesting LLMs learned surface heuristics rather than structural grammar rules.

Why do large language models fail at complex linguistic tasks?

Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.

Do neural networks naturally learn modular compositional structure?

Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.

Can verification separate structural near-misses from topical matches?

A two-stage pipeline—pooled-cosine recall followed by a small Transformer verifier operating on token-token similarity maps—reliably rejects structural near-misses that MaxSim-style late interaction cannot. The verifier succeeds because it operates on full token interaction patterns rather than compressed vectors.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a researcher testing whether a curated library's claims about compositionality testing have aged or been superseded. The precise question: what behavioral or mechanistic test reliably separates genuine compositional reasoning from fractured feature presence—and does that test still hold as models scale?

What a curated library found — and when (dated claims, not current truth): These findings span 2023–2026.
• Linear decodability of constituent features predicts compositional success only when training data already covers the combinations; presence alone is a fractured-presence signature, not compositionality (2023–2024).
• Transformers exhibiting compositional appearance often memorize computation subgraphs and stitch them by pattern-match; novel compositions fail sharply with compounding errors, whereas genuine compositionality should degrade only gradually under recombination (2023–2025).
• Grammatical competence degrades predictably with syntactic depth and embedding; this failure signature suggests surface heuristics rather than recursive rules (2025).
• Causal/ablation tests—surgical pruning of isolated subnetworks—discriminate compositionality from entanglement: modular networks show function-specific ablation effects, fractured ones smear (2023).
• Scaling with coverage of training combinations *can* produce real compositional generalization without architectural tricks (2025).

Anchor papers (verify; mind their dates):
• arXiv:2301.10884 (2023-01): Break It Down—neural network structural compositionality via ablation.
• arXiv:2305.18654 (2023-05): Faith and Fate—Transformer limits on compositionality.
• arXiv:2507.07207 (2025-07): Scaling can lead to compositional generalization.
• arXiv:2603.29025 (2026-03): The Model Says Walk—surface heuristics overriding implicit constraints.

Your task:
(1) RE-TEST EACH CONSTRAINT. For each finding above—especially linear decodability as a red herring, subgraph-matching as a failure mode, and depth-dependent degradation—determine whether recent scaling, instruction-tuning, mechanistic-interp tooling (SAEs, circuit analysis), multi-step reasoning architectures (chain-of-thought, process rewards), or evaluation regimes have since relaxed or overturned it. Flag which constraints *still hold* and which have been dissolved; cite the mechanism.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. If recent models (o1, Claude 3.5, Llama 3.2) exhibit compositional scaling without the predicted failures, or if new ablation/probing methods reveal structure the old tests missed, name and ground it.
(3) Propose 2 research questions that assume the testing regime itself has shifted—e.g., does genuine compositionality exist at all scales, or do we only see it under specific curricula? Does the modular/entangled distinction even apply to mechanistically-diverse architectures?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

What test distinguishes genuine compositionality from fractured feature presence?

Sources 7 notes

Next inquiring lines