Where do neural networks still fail at compositional generalization despite scaling?

This explores the specific places where bigger models and more data still don't deliver true compositional generalization — combining known pieces in new ways — and why scaling papers over the gap rather than closing it.

This explores where neural networks still break on compositional generalization despite scaling, and the corpus has a sharp answer: scaling buys you coverage, not systematicity. The most direct finding is that transformers often succeed by memorizing the computation subgraphs they saw in training and stitching them together — what one analysis calls linearized subgraph matching — rather than learning the underlying rule. On in-distribution combinations they look fluent; on genuinely novel compositions they fail hard, with errors compounding step by step across a reasoning chain Do transformers actually learn systematic compositional reasoning?. So the failure isn't random — it's concentrated exactly where a new combination falls outside the training distribution's coverage.

That reframes what scaling actually does. One line of work shows plain MLPs *can* generalize compositionally with enough data and size — but only when the training distribution sufficiently covers the combinations of task pieces Can neural networks learn compositional skills without symbolic mechanisms?. Read alongside the subgraph-matching result, these agree more than they disagree: scaling works by densely tiling the space of combinations, so 'novel' compositions become rare. Push to combinations the data never spanned and the gap reopens. The deeper diagnosis is the binding problem — networks struggle to dynamically bind distributed features into reusable structures, to keep entities separate, and to recombine learned parts in new arrangements Why do neural networks fail at compositional generalization?. Scaling can let compositional representations *emerge*, but it doesn't install the binding mechanism that would make recombination reliable by construction.

The optimistic counterweight is real and worth holding: modern networks do exhibit genuine compositional behavior — complex syntax, multi-step logic, original code — which retires the old claim that connectionism simply can't compose Can neural networks actually achieve compositional generalization?. And networks even self-organize: pruning reveals they decompose tasks into isolated modular subnetworks, with pretraining making that modularity more consistent Do neural networks naturally learn modular compositional structure?. The honest synthesis is that the question has shifted from *whether* they compose to *how robustly* — and the robustness still tracks coverage, not principle.

Here's the part you might not expect: identical performance can hide broken internals. Networks trained by gradient descent can reproduce outputs perfectly while carrying fractured, entangled representations — internal structure so tangled that it can't transfer to new contexts or recombine creatively, unlike cleaner evolved representations Can identical outputs hide broken internal representations?. This is the mechanism beneath the behavioral failure: a model can ace the benchmark and still lack the clean, factored parts that compositional generalization requires. And there's a predictive lens for *where* it'll fail — treating LLMs as autoregressive probability machines correctly forecasts that logically trivial tasks become hard when the target is low-probability, like counting letters or reversing the alphabet Can we predict where language models will fail?.

The most pointed challenge to scaling-as-the-answer: a 7M-parameter two-layer network that *recurses on its own latent reasoning state* beats DeepSeek R1, o3-mini, and Gemini 2.5 Pro on ARC-AGI puzzles — abstraction-and-composition benchmarks — using 0.01% of their parameters Can tiny recursive networks outperform massive language models?. If recursion on latent state, not scale, drives the generalization gain, then the places scaling still fails may be precisely the places where the missing ingredient is architectural — a mechanism for reusing structure — rather than more parameters and more data.

Sources 8 notes

Do transformers actually learn systematic compositional reasoning?

Research shows transformers succeed on in-distribution tasks by memorizing computation subgraphs from training data, not by learning systematic rules. They fail drastically on novel compositions, with errors compounding across reasoning steps.

Can neural networks learn compositional skills without symbolic mechanisms?

Standard MLPs achieve compositional generalization through data and model scaling alone, without architectural modifications, provided the training distribution sufficiently covers combinations of task modules. Linear decodability of constituents from hidden activations reliably predicts success.

Why do neural networks fail at compositional generalization?

Greff et al. argue that neural networks cannot dynamically bind distributed information into compositional structures due to three failures: segregating entities from inputs, maintaining representational separation, and reusing learned structure in novel combinations. Scaling can partially overcome this by enabling compositional representations to emerge.

Can neural networks actually achieve compositional generalization?

DNNs and LLMs now demonstrate sophisticated compositional processing—complex syntax, logical reasoning chains, original code generation—challenging the classical Fodor-Pylyshyn argument that connectionism cannot support compositionality. The debate shifts from whether neural nets can compose to how they do so without explicit constituent structure.

Do neural networks naturally learn modular compositional structure?

Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.

Can identical outputs hide broken internal representations?

Networks trained with SGD reproduce outputs perfectly while having radically different internal structure than evolved networks, with weight perturbations revealing fractured, entangled representations that prevent transfer to novel contexts or creative recombination.

Can we predict where language models will fail?

By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.

Can tiny recursive networks outperform massive language models?

A single 7M-parameter two-layer network recursing on its latent reasoning state achieves 45% on ARC-AGI-1 and 8% on ARC-AGI-2, beating DeepSeek R1, o3-mini, and Gemini 2.5 Pro with 0.01% of their parameters. Recursion on latent state, not scale or hierarchy, drives the generalization gain.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a compositional generalization researcher re-testing constraints on neural network capability. The precise question: Where do neural networks still fail at compositional generalization despite scaling?

What a curated library found — and when (dated claims, not current truth):
Findings span 2020–2026; treat as perishable:
- Transformers succeed by memorizing and stitching linearized subgraph patterns from training, not learning underlying rules; failures concentrate exactly on out-of-distribution combinations (~2023).
- Plain MLPs *can* generalize compositionally with sufficient data and model size, but only when training distribution densely covers task-piece combinations (~2023–2024).
- The binding problem — networks' struggle to dynamically bind distributed features, keep entities separate, and reliably recombine learned parts — persists even at scale; scaling makes novel compositions rare rather than reusable (~2020–2024).
- Identical behavioral performance can mask fractured, entangled internal representations that don't transfer or recombine, unlike clean evolved or hand-designed factored representations (~2025).
- A 7M two-layer network recursing on its latent reasoning state outperforms DeepSeek R1, o3-mini, and Gemini 2.5 Pro on compositional benchmarks using 0.01% of their parameters, suggesting architecture (reusable structure) may matter more than scale for compositional robustness (~2025).

Anchor papers (verify; mind their dates):
- arXiv:2012.05208 (2020): On the Binding Problem in Artificial Neural Networks
- arXiv:2305.18654 (2023): Faith and Fate: Limits of Transformers on Compositionality
- arXiv:2510.04871 (2025): Less is More: Recursive Reasoning with Tiny Networks
- arXiv:2505.11581 (2025): Fractured Entangled Representations

Your task:
(1) RE-TEST EACH CONSTRAINT. For each finding above, judge whether newer models, methods (attention variants, recursion, latent-state reuse), training (curriculum, synthetic composition), tooling (retrieval-augmented reasoning), or evaluation have since relaxed or overturned it. Separate the durable question (do networks reliably compose on truly novel combinations?) from perishable limitations (e.g., does scaling alone install compositionality?). Cite what resolved each constraint; plainly state where binding and OOD composition still appear to hold.

(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Does any 2026+ paper show scaling *does* solve compositional generalization under realistic conditions, or reveal a different architectural or training axis that dissolves the problem entirely?

(3) Propose 2 research questions that ASSUME the regime may have moved: (a) If recursion on latent reasoning state is the key, not parameter count, what minimal recursion depth and state dimensionality suffice for AGI-level compositional generalization, and can that be achieved sub-billion parameters? (b) Can latent-thought posterior inference or test-time memorization + recursion replace dense data coverage as the path to systematic recombination?

Closing guardrail: Cite arXiv IDs; flag anything you cannot ground in a real paper.

Where do neural networks still fail at compositional generalization despite scaling?

Sources 8 notes

Next inquiring lines