How does scaling and training data enable compositional behavior without symbolic mechanisms?

This explores how neural networks pull off compositional behavior — combining known pieces into new wholes — through sheer scale and training coverage rather than any built-in symbol-manipulation machinery, and where that strategy holds up versus breaks down.

This explores how neural networks pull off compositional behavior — combining known parts into new wholes — without explicit symbolic rules, relying instead on scale and what the training data covers. The corpus offers a surprisingly clean answer: composition can emerge from scaling alone, but only inside the territory the training distribution already maps, and the mechanism underneath is more like pattern-matching than rule-following.

The optimistic case is direct. Plain MLPs reach compositional generalization through data and model scaling with no architectural tricks, *provided* the training data sufficiently covers the combinations of task modules — and you can predict success by checking whether the constituent parts are linearly decodable from the hidden activations Can neural networks learn compositional skills without symbolic mechanisms?. This isn't accidental: networks tend to decompose compositional tasks into isolated modular subnetworks on their own, and pretraining makes that modular structure more consistent and reliable Do neural networks naturally learn modular compositional structure?. So the substrate for composition self-organizes from gradient descent, and modern systems now demonstrably handle complex syntax, logical chains, and original code — directly challenging the old Fodor-Pylyshyn claim that connectionism can't compose at all Can neural networks actually achieve compositional generalization?.

But the same corpus undercuts the word "compositional" itself. When researchers look closely at transformers, the apparent reasoning reduces to *linearized subgraph matching*: models memorize computation subgraphs from training and stitch them together, which works in-distribution but fails drastically on genuinely novel combinations, with errors compounding step by step Do transformers actually learn systematic compositional reasoning?. Strip the familiar semantics out of a reasoning task and performance collapses even when the correct rules are handed to the model in-context — evidence that LLMs lean on semantic associations and parametric commonsense, not formal symbol manipulation Do large language models reason symbolically or semantically?. In other words, scaling buys you composition that is real but bounded by the training distribution's semantics, not the open-ended systematicity a symbolic system would give.

Here's the doorway worth opening: the thing that *looks* like it should guarantee compositionality — linear decodability of the parts — turns out to be a treacherous signal. A model can carry all the linearly decodable features a task needs while its internal organization is fundamentally fractured, leaving it brittle to perturbation and distribution shift in ways standard accuracy metrics never reveal Can models be smart without organized internal structure?. So the same probe that predicts compositional success in one paper masks broken structure in another. Composition without symbols is genuinely emergent, but "it generalizes on the benchmark" and "it composes systematically" are not the same claim.

If you want the deeper why-it-works-at-all framing, two threads reframe the question. One argues LLMs operationalize Saussure's *langue* — they compress purely relational structure from text, showing fluent generative behavior needs no external referents or grounding at all Can language models learn meaning without engaging the world?. The other shows a single finite transformer is in principle Turing-complete given the right prompt — the capacity for arbitrary composition exists, even though ordinary training rarely coaxes a model into actually implementing programs that way Can a single transformer become universally programmable through prompts?. The capability is latent in the architecture; whether it surfaces is a question of data coverage and training, not symbols.

Sources 8 notes

Can neural networks learn compositional skills without symbolic mechanisms?

Standard MLPs achieve compositional generalization through data and model scaling alone, without architectural modifications, provided the training distribution sufficiently covers combinations of task modules. Linear decodability of constituents from hidden activations reliably predicts success.

Do neural networks naturally learn modular compositional structure?

Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.

Can neural networks actually achieve compositional generalization?

DNNs and LLMs now demonstrate sophisticated compositional processing—complex syntax, logical reasoning chains, original code generation—challenging the classical Fodor-Pylyshyn argument that connectionism cannot support compositionality. The debate shifts from whether neural nets can compose to how they do so without explicit constituent structure.

Do transformers actually learn systematic compositional reasoning?

Research shows transformers succeed on in-distribution tasks by memorizing computation subgraphs from training data, not by learning systematic rules. They fail drastically on novel compositions, with errors compounding across reasoning steps.

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Can models be smart without organized internal structure?

Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.

Can language models learn meaning without engaging the world?

Research shows LLMs learn culturally situated discourse patterns by compressing relational structure from text, demonstrating that fluent language generation requires no external referents or embodied grounding.

Can a single transformer become universally programmable through prompts?

Research proves a single finite-size transformer exists that can compute any computable function given the right prompt, achieving complexity bounds nearly matching unbounded models. However, standard training rarely produces models that learn to implement arbitrary programs this way.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question remains open: How do neural networks achieve compositional behavior—combining known parts into new wholes—without explicit symbolic rules, relying instead on scale and training data coverage?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2025; treat these as perishable constraints to be re-tested:

• Compositional generalization emerges from scaling alone when training data sufficiently covers task-module combinations; networks self-organize into modular subnetworks, and linear decodability of parts predicts success (2023).
• Transformers' apparent reasoning reduces to linearized subgraph matching—memorizing computation subgraphs from training and stitching them together; this works in-distribution but fails drastically on genuinely novel combinations (2023).
• LLMs are in-context semantic reasoners, not symbolic reasoners; they lean on semantic associations and parametric commonsense, not formal symbol manipulation; stripping familiar semantics causes performance collapse even with correct rules provided (2023).
• Linear decodability of features masks fractured internal organization and brittleness to distribution shift; identical accuracy metrics can hide fundamentally different internal representations (2024).
• A single finite transformer is Turing-complete in principle given the right prompt; composition capacity is latent but rarely surfaced by ordinary training (2024–2025).

Anchor papers (verify; mind their dates):
• arXiv:2305.18654 (Faith and Fate: Limits of Transformers on Compositionality, 2023-05)
• arXiv:2305.14825 (Large Language Models are In-Context Semantic Reasoners rather than Symbolic Reasoners, 2023-05)
• arXiv:2411.01992 (Ask, and it shall be given: Turing completeness of prompting, 2024-11)
• arXiv:2507.07207 (Scaling can lead to compositional generalization, 2025-07)

Your task:
(1) RE-TEST EACH CONSTRAINT. For each claim above, assess whether newer models (post-2025-08), improved training methods, new evaluation harnesses, or orchestration (e.g., multi-agent reasoning, retrieval-augmented generation, adaptive prompting) have RELAXED or OVERTURNED the limitation. Separate the durable question (does composition truly emerge from data + scale?) from the perishable boundary (in-distribution brittleness, reliance on semantics). Cite what resolved each constraint, or report plainly where it still holds.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months that questions the library's framing—e.g., does recent evidence suggest composition *does* require implicit symbolic structure, or that scale alone is sufficient under new conditions?
(3) Propose two research questions that assume the regime may have shifted:
   (a) One assuming composition barriers have been relaxed (what enables distribution shift now?).
   (b) One assuming they have hardened (what architectural or data-curation invariant makes composition robust?).

Cite arXiv IDs; flag anything you cannot ground in a real paper.

How does scaling and training data enable compositional behavior without symbolic mechanisms?

Sources 8 notes

Next inquiring lines