Why does recursion on latent state drive generalization better than hierarchy?
This explores why a model that loops over its own internal 'thinking state' generalizes to hard problems better than one that stacks specialized layers or modules in a fixed hierarchy.
This explores why recursion on latent state beats hierarchy for generalization — and the corpus's sharpest data point is almost absurd: a single 7-million-parameter, two-layer network that simply *re-runs itself on its own evolving latent reasoning state* scores 45% on ARC-AGI-1, beating DeepSeek R1, o3-mini, and Gemini 2.5 Pro at roughly 0.01% of their size Can tiny recursive networks outperform massive language models?. The headline isn't 'small is enough' — it's *what* did the work. Scale didn't, and a deeper fixed hierarchy of distinct layers didn't either. Iterating on a compressed internal state did.
Why would looping outperform stacking? A clue comes from the sample-complexity side: predicting your own latents recovers compositional, hierarchical structure with a number of examples that stays *constant* in the depth of that hierarchy, while predicting raw tokens needs exponentially more — because nearby latent states are far more correlated than surface tokens Why is predicting latents more sample-efficient than tokens?. A hierarchy hard-codes how many levels of abstraction you get and bakes them into separate parameters. Recursion instead lets one shared transformation *climb* abstraction levels by reapplying itself, so the model isn't committing in advance to a fixed depth of structure — it discovers how many passes a given problem needs. Depth-as-composition rather than width is the same lesson at architecture scale: thin-and-deep sub-billion LLMs beat balanced ones precisely because composing concepts through repeated transformation generalizes better than spreading capacity sideways Does depth matter more than width for tiny language models?.
There's a deeper reason hierarchy is the weaker bet: real generalization seems to want modularity that the network *finds*, not modularity you impose. Pruning studies show networks spontaneously route compositional subtasks into isolated subnetworks, and pretraining makes that emergent structure more reliable Do neural networks naturally learn modular compositional structure? — and the long-running Fodor-Pylyshyn debate has flipped from 'can connectionist models compose at all?' to 'how do they compose without explicit symbolic constituents?' Can neural networks actually achieve compositional generalization?. A fixed hierarchy is an *assumed* decomposition; recursion on latent state lets the decomposition be learned and re-entered as needed.
The newest moves extend the recursive trick rather than retreating to hierarchy. Making the latent transition *stochastic* lets a recursive reasoner hold genuine uncertainty and represent a distribution over solutions instead of one guess Can stochastic latent reasoning help models explore multiple solutions? — and that same stochasticity unlocks scaling in *width*, sampling many parallel latent trajectories to explore the solution space without paying the serial latency of going ever deeper Can reasoning systems scale wider instead of only deeper?. Separately, treating latent 'thought vectors' as their own scaling axis decoupled from parameter count buys sample and few-shot efficiency a bigger decoder alone wouldn't Can latent thought vectors scale language models beyond parameters?.
The thing you might not have expected to learn: 'depth' here is doing two jobs we usually conflate. A hierarchy gives you depth in *parameters* — more distinct layers, more weights, fixed structure. Recursion gives you depth in *computation* — the same small transformation applied as many times as the problem demands. Generalization tracks the second, not the first. That's why a two-layer loop can out-reason a frontier model, and why the field's frontier is now about scaling the loop (stochastically, in parallel) rather than building taller ladders.
Sources 8 notes
A single 7M-parameter two-layer network recursing on its latent reasoning state achieves 45% on ARC-AGI-1 and 8% on ARC-AGI-2, beating DeepSeek R1, o3-mini, and Gemini 2.5 Pro with 0.01% of their parameters. Recursion on latent state, not scale or hierarchy, drives the generalization gain.
A formal sample-complexity analysis proves latent-level self-supervision (data2vec/JEPA style) recovers compositional structure with samples constant in hierarchy depth, while token-level learning requires exponential samples—because same-level latents are far more correlated than raw tokens.
MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.
Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.
DNNs and LLMs now demonstrate sophisticated compositional processing—complex syntax, logical reasoning chains, original code generation—challenging the classical Fodor-Pylyshyn argument that connectionism cannot support compositionality. The debate shifts from whether neural nets can compose to how they do so without explicit constituent structure.
GRAM replaces deterministic latent updates with stochastic sampling, enabling models to represent distributions over solutions rather than single predictions. This allows handling of ambiguous problems and multiple valid strategies that deterministic designs cannot represent.
GRAM shows that stochastic latent transitions enabling parallel trajectory sampling sidestep the serial latency cost of depth-only scaling. Width matches token-level parallelism benefits: independent paths sample the solution space without variance inflation.
Latent-Thought Language Models achieve superior sample and parameter efficiency by coupling fast local variational learning with slow global decoder learning. This dual-rate scheme scales few-shot reasoning across both model and latent size, creating independent scaling dimensions beyond traditional parameter scaling.