What makes recursive structure different from other forms of compositional generalization?
This explores why recursion — structure that nests inside itself, like a clause within a clause — seems to resist the scale-it-up cure that works for ordinary 'mix-and-match' compositional generalization.
This reads the question as asking what's special about *recursive* structure (a thing made of smaller copies of itself — a clause embedded in a clause, an operation applied to its own output) compared to the broader family of compositional generalization (recombining known parts in novel arrangements). The corpus draws a surprisingly clean line between the two, and it runs along the axis of *depth*.
For garden-variety compositional generalization, the encouraging news is that you may not need anything clever. Plain networks reach it through data and model scale alone, as long as the training distribution covers enough combinations of the underlying modules Can neural networks learn compositional skills without symbolic mechanisms? — and networks even tend to spontaneously sort compositional tasks into isolated, reusable subnetworks Do neural networks naturally learn modular compositional structure?. The catch is that much of this 'generalization' turns out to be memorized computation reused — transformers often succeed by matching linearized subgraphs they saw in training, then fail hard on genuinely novel compositions Do transformers actually learn systematic compositional reasoning?. Greff and colleagues frame the deeper obstacle as the *binding problem*: networks struggle to dynamically bind distributed pieces into fresh structures and reuse that structure in new combinations Why do neural networks fail at compositional generalization?.
Recursion is where this story breaks in a specific, telling way. The single sharpest data point in the corpus: Pushdown Layers add an explicit stack to attention and win 3–5x on *syntactic* generalization — and the authors note this matters precisely because recursive structure benefits from a built-in architectural bias even though general compositional generalization emerges from scale Can explicit stack tracking improve how transformers learn recursive syntax?. In other words: scale buys you recombination; it does not buy you nesting. The symptom shows up everywhere recursion appears — LLMs degrade *predictably* as syntactic depth and embedding increase, handling simple sentences fine while consistently botching deeply embedded clauses Does LLM grammatical performance decline with structural complexity?, misidentifying embedded clauses and complex nominals as structure stacks up Why do large language models fail at complex linguistic tasks?.
The reason recursion is different has a name in complexity theory. A fixed-depth transformer lives under the AC0/TC0 ceiling — it has a bounded number of sequential computation steps no matter how wide you make it. Recursion needs *re-application*: feeding an operation its own output, an unbounded number of times. The fixes that work are the ones that restore depth rather than width. Looped, parameter-shared 'recurrent-depth' transformers achieve systematic generalization and can extrapolate to deeper inputs than they trained on, emerging through a sharp three-phase grokking process Can looped transformers generalize to unseen knowledge combinations?. The Hierarchical Reasoning Model couples slow and fast recurrent timescales to escape that exact AC0/TC0 ceiling, nailing Sudoku and mazes where chain-of-thought collapses — with only 27M parameters Can recurrent hierarchies achieve reasoning that transformers cannot?.
So the thing you might not have known you wanted to know: compositional generalization and recursion fail for *different reasons*, and they want *different cures*. Recombination is a coverage problem — show the model enough of the space and breadth scaling closes the gap. Recursion is a *depth* problem — and you can't paper over it with more parameters or more data; you have to give the architecture a way to apply itself to its own output, whether that's an explicit stack Can explicit stack tracking improve how transformers learn recursive syntax? or genuine recurrence Can looped transformers generalize to unseen knowledge combinations? Can recurrent hierarchies achieve reasoning that transformers cannot?.
Sources 9 notes
Standard MLPs achieve compositional generalization through data and model scaling alone, without architectural modifications, provided the training distribution sufficiently covers combinations of task modules. Linear decodability of constituents from hidden activations reliably predicts success.
Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.
Research shows transformers succeed on in-distribution tasks by memorizing computation subgraphs from training data, not by learning systematic rules. They fail drastically on novel compositions, with errors compounding across reasoning steps.
Greff et al. argue that neural networks cannot dynamically bind distributed information into compositional structures due to three failures: segregating entities from inputs, maintaining representational separation, and reusing learned structure in novel combinations. Scaling can partially overcome this by enabling compositional representations to emerge.
Pushdown Layers—a drop-in self-attention replacement with explicit stack tracking—achieve 3-5x more sample-efficient syntactic generalization while maintaining perplexity. The improvement shows that recursive structure specifically benefits from architectural inductive bias despite general compositional generalization emerging from scale.
LLMs show systematic performance decline as syntactic depth and embedding increase. Simple sentences are handled well while complex structures with recursion and embedding fail consistently, suggesting LLMs learned surface heuristics rather than structural grammar rules.
Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.
Recurrent-depth transformers with shared parameters across iterations enable systematic generalization and depth extrapolation that vanilla transformers cannot achieve. This emerges through a sharp three-phase process: memorization, in-distribution, then out-of-distribution generalization.
The Hierarchical Reasoning Model couples slow abstract planning with fast detailed computation across two timescales, achieving near-perfect performance on Sudoku and mazes where chain-of-thought methods fail completely. With only 27M parameters and 1,000 samples, HRM escapes the AC0/TC0 complexity ceiling that constrains fixed-depth transformers.