Can neural networks learn compositional skills without symbolic mechanisms?
Do neural networks need explicit symbolic architecture to compose learned concepts, or can scaling alone enable compositional generalization? This asks whether compositionality is an architectural feature or an emergent property of scale.
The question: do neural networks need explicit symbolic mechanisms to achieve compositionality, or does scaling suffice?
The answer: scaling data and model size leads to compositional generalization on standard MLPs, without architectural modifications — but with a critical condition: the training distribution must sufficiently cover the task space. Individual modules need not appear in isolation, but they must appear in enough combinations that the model can extract them.
Three key contributions:
Proof of representational capacity. MLPs can approximate a general class of compositional task families (hyperteachers) to arbitrary precision using only a linear number of neurons relative to the number of task modules. Memorizing all tasks requires exponential capacity; the compositional solution is fundamentally more efficient.
Linear decodability as a compositionality signature. When networks successfully compositionally generalize, the task constituents can be linearly decoded from hidden activations. This metric predicts failures in text-to-image models — when concepts cannot be linearly decoded, the model fails to compose them.
Scaling limits. Despite progress, performance deteriorates as the number of composed concepts grows. The multiplicative nature of compositionality means even scaled models hit composition limits — the exponential growth eventually exceeds any finite training distribution.
This directly addresses Why do neural networks fail at compositional generalization?: the binding problem is solvable through scaling when training covers the task space, but remains unsolved for arbitrary novel compositions. The failure mode is not inability to learn compositional structure but insufficient exposure to the combinatorial space.
The practical implication for LLMs: compositional generalization in language (novel sentence structures, new concept combinations) should improve with scale — but the tails of the combinatorial space will always remain sparsely covered, predicting continued failures on truly novel compositions.
SKiC prompting: unlocking compositional generalization with few examples: Skills-in-Context (SKiC) prompting shows that compositional generalization can be unlocked with remarkably few examples — as few as two exemplars — when the prompt structure explicitly grounds each reasoning step on foundational skills. The SKiC prompt has three blocks: (1) skills with instructions, (2) compositional examples showing how to combine skills, (3) the problem. This one-stage approach achieves near-perfect systematic generalization and is more general than decomposition-based methods (handles complex computation graphs that cannot be linearly decomposed). Intriguingly, SKiC also unlocks "latent potential" — pre-existing internal skills from pretraining that standard prompting fails to activate. This confirms the training-coverage condition from a different angle: the model has compositional capacity from pretraining, but prompting must explicitly invoke the skill-grounding structure to surface it. Source: Prompts Prompting.
Inquiring lines that use this note as a source 33
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Can neural networks represent symbolic structures without explicit mechanisms?
- Why do human-designed neural architectures eventually get replaced by learned ones?
- Why are polysemantic features concentrated in early neural network layers?
- What makes linear decodability a reliable signal of compositionality?
- Does scaling model size solve compositional generalization problems?
- Can symbolic mechanisms improve transformer compositional abilities?
- Why do scaling laws fail to predict optimal architectures at small parameter counts?
- Does compositional generalization emerge suddenly or improve smoothly with scale?
- Can we detect and measure circuit formation before generalization emerges?
- Does scaling data automatically produce compositional reasoning or just better feature encoding?
- What test distinguishes genuine compositionality from fractured feature presence?
- What makes recursive structure different from other forms of compositional generalization?
- Do substitute networks converge differently than complement networks?
- Can scaling alone create compositional generalization without explicit binding mechanisms?
- How do neural networks decompose complex tasks into modular subnetworks?
- Can granular function calling tasks learn composition from graph-sampled data?
- What other behavioral properties exist as linear directions in activation space?
- Can sub-task handlers be swapped between neural and symbolic systems?
- Can geometric structure in representations exist without supporting functional mechanisms?
- What role does query-level exposure play in enabling compositional generalization?
- Why does scaling data and model size improve compositional generalization?
- How do neural networks decompose tasks into modular subnetworks that transfer?
- Which hyperparameter theories best explain universal behaviors across neural networks?
- How do classical mechanics and statistical mechanics provide methodological templates for learning theory?
- How do ablation studies reveal function without representational characterization?
- Does sparsity enforce compositional structure or merely amplify existing modularity?
- Why does gradient descent discover compositional structure without explicit pressure?
- What architectural alternatives can capture compositional structure beyond pooled cosine?
- How can neural networks be interpretable by design rather than post-hoc?
- What makes recurrent depth enable compositional generalization across tasks?
- What makes a feature abstract versus concrete in neural network activations?
- How does scaling and training data enable compositional behavior without symbolic mechanisms?
- Where do neural networks still fail at compositional generalization despite scaling?
Related concepts in this collection 6
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Why do neural networks fail at compositional generalization?
Exploring whether the binding problem from neuroscience explains neural networks' inability to systematically generalize. The binding problem has three aspects—segregation, representation, and composition—each creating distinct failure modes in how networks handle structured information.
binding failure is solvable through scaling but only with sufficient training coverage; explains both successes and persistent failures
-
Do foundation models learn world models or task-specific shortcuts?
When transformer models predict sequences accurately, are they building genuine world models that capture underlying physics and logic? Or are they exploiting narrow patterns that fail under distribution shift?
FER tension: are scaled compositions genuine generalizations or scaled heuristics?
-
Can identical outputs hide broken internal representations?
Can neural networks produce correct outputs while having fundamentally fractured internal structure that prevents generalization and creativity? This challenges our assumptions about what performance benchmarks actually measure.
tension: scaling may produce compositionality in outputs while FER persists in representations
-
How do transformers learn to reason across multiple steps?
Does multi-hop reasoning in transformers emerge through distinct learning phases, and what geometric patterns in hidden representations explain when reasoning succeeds or fails?
mechanistic detail for the training-coverage condition: second-hop generalization requires query-level compositional exposure, confirming that compositional generalization depends on the training distribution covering the specific compositional structure, not just individual components
-
Can agents learn new skills without forgetting old ones?
Explores whether externalized skill libraries—storing learned behaviors as retrievable code rather than parameter updates—can solve the catastrophic forgetting problem that plagues continual learning systems.
VOYAGER's skill library is an external implementation of compositional generalization: complex skills are synthesized from primitives, achieving the efficient linear-scaling solution rather than exponential memorization; the ever-growing library progressively covers the combinatorial task space that the training-coverage condition requires
-
Can language help agents imagine goals they've never seen?
How might compositional language enable artificial agents to target outcomes beyond their training experience? This matters because it could unlock open-ended exploration without hand-coded reward functions.
IMAGINE leverages the compositionality that this note documents: familiar words recombine to describe unfamiliar outcomes, enabling agents to target goals outside their training distribution; this is compositional generalization applied to goal specification rather than task execution
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Scaling can lead to compositional generalization
- Break It Down: Evidence for Structural Compositionality in Neural Networks
- From Frege to chatGPT: Compositionality in language, cognition, and deep neural networks
- How do Transformers Learn Implicit Reasoning?
- Faith and Fate: Limits of Transformers on Compositionality
- Skills-in-Context Prompting: Unlocking Compositionality in Large Language Models
- Bigger is not always better: The importance of human-scale language modeling for psycholinguistics
- Nested Learning: The Illusion of Deep Learning Architectures
Original note title
compositional generalization emerges from scaling data and model size without explicit symbolic mechanisms