Can a single recursive network replace hierarchical dual-network architectures?
This explores a head-to-head architectural debate: whether the gains usually credited to splitting reasoning across two coordinated networks (slow planner + fast computer) actually come from recursion itself — meaning one small network looping on its own latent state could do the same job.
This explores a head-to-head architectural debate: whether the gains usually credited to splitting reasoning across two coordinated networks actually come from recursion itself. The corpus contains both sides of this argument almost as a direct rebuttal. The Hierarchical Reasoning Model couples a slow abstract-planning network with a fast detail-computing network across two timescales, and on hard symbolic tasks like Sudoku and mazes it crushes chain-of-thought methods with only 27M parameters — escaping the fixed-depth complexity ceiling that limits ordinary transformers Can recurrent hierarchies achieve reasoning that transformers cannot?. The natural reading is that the hierarchy — two networks at two speeds — is what buys the extra reasoning depth.
But the Tiny Recursive Model pulls that conclusion apart. A single two-layer, 7M-parameter network that simply recurses on its own latent reasoning state beats DeepSeek R1, o3-mini, and Gemini 2.5 Pro on ARC-AGI puzzles with a fraction of a percent of their parameters Can tiny recursive networks outperform massive language models?. The claimed lesson is blunt: recursion on latent state — not scale, and not hierarchy — drives the generalization. So the answer to the literal question is "apparently yes," and the more interesting finding is that the dual-network design may have been over-credited for what recursion alone delivers.
There's a deeper pattern worth pulling in here, because "collapse a multi-part system into one recursive process" shows up elsewhere too. The Thread Inference Model structures reasoning as recursive subtask trees and explicitly lets a single model do work that previously needed multi-agent systems, by handling the full recursive decomposition internally Can recursive subtask trees overcome context window limits?. The unifying move in both cases: depth-through-recursion substitutes for structural division of labor.
That said, the corpus also pushes back on treating recursion as a free lunch. Depth-only scaling pays a serial latency cost, and GRAM shows reasoning systems can scale in *width* instead — sampling parallel latent trajectories that match the benefits of going deeper without the variance penalty Can reasoning systems scale wider instead of only deeper?. And separating planning from synthesis genuinely helps in other settings: hierarchical retrieval architectures that split query-planning from answer-synthesis reduce interference and win on multi-hop queries Do hierarchical retrieval architectures outperform flat ones on complex queries?. So "separation" isn't worthless — it's just not the thing that was doing the heavy lifting in HRM's reasoning depth.
The surprise the reader probably didn't expect: neural networks tend to grow modular, isolated subnetworks for compositional subtasks *on their own*, without anyone designing the split Do neural networks naturally learn modular compositional structure?. That reframes the whole question. A single recursive network may not really be "replacing" the hierarchy so much as internalizing it — discovering the planner/computer division implicitly through training rather than having it hard-wired as two boxes. The architecture debate may matter less than where the modularity lives.
Sources 6 notes
The Hierarchical Reasoning Model couples slow abstract planning with fast detailed computation across two timescales, achieving near-perfect performance on Sudoku and mazes where chain-of-thought methods fail completely. With only 27M parameters and 1,000 samples, HRM escapes the AC0/TC0 complexity ceiling that constrains fixed-depth transformers.
A single 7M-parameter two-layer network recursing on its latent reasoning state achieves 45% on ARC-AGI-1 and 8% on ARC-AGI-2, beating DeepSeek R1, o3-mini, and Gemini 2.5 Pro with 0.01% of their parameters. Recursion on latent state, not scale or hierarchy, drives the generalization gain.
The Thread Inference Model demonstrates that reasoning structured as recursive subtask trees with rule-based KV cache pruning sustains accurate reasoning beyond context limits, even when manipulating 90% of the cache. This enables single models to replace multi-agent systems by handling full recursive reasoning internally.
GRAM shows that stochastic latent transitions enabling parallel trajectory sampling sidestep the serial latency cost of depth-only scaling. Width matches token-level parallelism benefits: independent paths sample the solution space without variance inflation.
Separating query planning from answer synthesis into distinct components reduces interference and improves multi-hop query performance. This architectural principle mirrors documented benefits of separating planning from execution in agent design.
Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.