Can recurrent hierarchies achieve reasoning that transformers cannot?
Can a dual-timescale recurrent architecture escape the computational limitations of standard transformers and solve complex reasoning tasks without explicit chain-of-thought? This explores whether architectural design, not scale, enables true algorithmic reasoning.
The Hierarchical Reasoning Model (HRM) is a recurrent architecture with two coupled modules: a high-level (H) module for slow, abstract planning and a low-level (L) module for fast, detailed computation. The key mechanism is "hierarchical convergence" — the fast L-module completes multiple computational steps and reaches local equilibrium, then the slow H-module advances, and L is reset for a new phase. This avoids the rapid premature convergence that plagues standard recurrent models.
The results are striking. With only 27 million parameters and 1,000 training samples, no pre-training or CoT data, HRM achieves near-perfect accuracy on Sudoku-Extreme Full and optimal pathfinding in 30×30 mazes — tasks where state-of-the-art CoT methods achieve 0% accuracy. It outperforms much larger models with significantly longer context windows on ARC, a key AGI benchmark.
The architecture is brain-inspired: the human brain organizes computation hierarchically across cortical regions operating at different timescales. Recurrent feedback loops iteratively refine representations — slow higher-level areas guide, fast lower-level circuits execute. The brain achieves this depth without backpropagation through time.
HRM mirrors this with an O(1) memory gradient approximation. Because each recurrent module converges to a fixed point, gradients can be computed at equilibrium in a single step rather than unrolling through time. The gradient path is: output head → final H-state → final L-state → input embedding. No BPTT, no O(T) memory. This aligns with neuroscience evidence that cortical credit assignment uses short-range, temporally local mechanisms.
The deeper implication: standard Transformers are "paradoxically shallow" despite deep learning's founding principle of stacking layers. Their fixed depth places them in AC0/TC0 complexity classes — they are not Turing-complete and cannot execute complex algorithmic reasoning in a purely end-to-end manner. HRM's hierarchical recurrence escapes this constraint by achieving effectively unbounded computational depth.
This extends Can models reason without generating visible thinking tokens? with a third distinct architecture beyond depth-recurrent and Heima — one that introduces hierarchical multi-timescale processing rather than uniform recurrence.
Inquiring lines that use this note as a source 54
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- What is selective resonance and why do transformers not perform it?
- How do transformers perform multi-hop reasoning across distant training documents?
- Can neural networks represent symbolic structures without explicit mechanisms?
- Why do human-designed neural architectures eventually get replaced by learned ones?
- How does error propagation limit transformer performance on complex tasks?
- Can symbolic mechanisms improve transformer compositional abilities?
- How do multimodal AI architectures compare to human brain export pathways?
- Why do hierarchical architectures better implement the deep research definition?
- Why do scaling laws fail to predict optimal architectures at small parameter counts?
- Can sequential computation through depth solve problems that parallel width cannot?
- Why do hybrid paradigms outperform pure autoregressive or pure diffusion approaches?
- How does circuit complexity limit which grammatical structures transformers can acquire?
- Can latent recurrence and energy minimization both escape the same computational depth constraints?
- Can neural networks implement genuine algorithms or only statistical pattern matching?
- How do hierarchical architectures separate planning from retrieval differently than flat ones?
- Why do standard transformers fail on problems requiring serial algorithmic reasoning?
- Does architectural design matter more than model scale for reasoning tasks?
- How do biological brains organize computation across different cortical timescales?
- How does per-token adaptive compute improve efficiency in recurrent reasoning?
- Could graph neural networks fundamentally outperform transformers on structured reasoning?
- Do decoder-only models have inherent architectural limits for non-sequential information?
- Can any architecture fundamentally solve problems that require inherently sequential computation?
- Why do standard transformers fail to encode recursive structure in their hidden states?
- What makes recursive structure different from other forms of compositional generalization?
- What distinguishes hierarchical dual-recurrence from flat parameter-sharing recurrence?
- How does dynamic recurrence during training improve depth extrapolation?
- Can transformers reason beyond fixed architectural depth limits?
- Can bounded-depth transformers solve inherently sequential problems?
- Can a single architecture represent both physical and mental possibility spaces?
- Can sub-task handlers be swapped between neural and symbolic systems?
- Why do hybrid memory systems outperform single-tier AI architectures?
- Can offline recurrent passes replicate sleep-based memory consolidation in AI?
- Can deterministic recurrent depth achieve the computational benefits of stochastic reasoning?
- Can energy-based transformers achieve deep reasoning without supervision?
- Can width-scaling replace depth-scaling on inherently sequential problems?
- What makes deterministic recursive reasoning models underperform on multi-solution tasks?
- What computational structures can actually scale serial reasoning depth?
- Do transformer architectures structurally bias models toward short-term optimization?
- What architectural alternatives can capture compositional structure beyond pooled cosine?
- Do KANs maintain their advantages in deep architectures and large-scale training?
- Can a single recursive network replace hierarchical dual-network architectures?
- What makes recurrent depth enable compositional generalization across tasks?
- How does single-pass generation differ from multi-stage synthesis architecturally?
- Can a two-layer network outgeneralize billion-parameter models through recursion alone?
- Why does looping computation outperform adding more transformer layers?
- Can recurrent transformers learn genuinely new computations beyond inference stages?
- How do fixed recurrent states trade off copying accuracy for filtering ability?
- What task profiles favor recurrent filtering over scaled attention mechanisms?
- Can looping enable reasoning capabilities that fixed-depth transformers fundamentally cannot achieve?
- How does selective looping in diffusion models differ from recurrence in autoregressive architectures?
- What computational stages does a looped block re-enact across multiple iterations?
- Does attention linearity alone explain the efficiency gains over standard transformers?
- Why does architecture matter more than training compute for inference efficiency?
- Can architectural changes reduce representational inequality in unified generators?
Related concepts in this collection 5
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can models reason without generating visible thinking tokens?
Explores whether intermediate reasoning must be verbalized as text tokens, or if models can think in hidden continuous space. Challenges a foundational assumption about how language models scale their reasoning capabilities.
HRM adds hierarchical dual-module architecture as a third latent reasoning approach
-
Can models reason without generating visible thinking steps?
Do machine reasoning systems actually require verbalized chains of thought, or can they solve complex problems through hidden computation? This challenges how we measure and understand reasoning.
HRM provides the strongest empirical evidence: near-perfect on tasks where verbalized CoT fails completely
-
Can parallel architectures solve inherently sequential problems?
Complexity theory suggests some problems like reasoning and planning are fundamentally sequential. Can parallel architectures like Transformers overcome this limitation, or do we need fundamentally different computational approaches?
HRM is an architecture that implements serial scaling through hierarchical recurrence
-
Does more thinking time actually improve LLM reasoning?
The intuition that extended thinking helps LLMs reason better seems obvious, but what does the empirical data actually show when we test it directly?
HRM challenges from another direction: right architecture > more thinking tokens
-
Can energy minimization unlock reasoning without domain-specific training?
Can a gradient descent-based architecture achieve system 2 thinking across any modality or problem type using only unsupervised learning, without verifiers or reasoning-specific rewards?
alternative latent reasoning architecture: HRM uses hierarchical recurrence for serial depth, EBTs use energy minimization for iterative refinement; both escape TC0 limitation without verbalized tokens but via fundamentally different mechanisms
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Hierarchical Reasoning Model
- Generative Recursive Reasoning
- Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
- Loop, Think, & Generalize: Implicit Reasoning in Recurrent-Depth Transformers
- A Mechanistic Analysis of Looped Reasoning Language Models
- Less is More: Recursive Reasoning with Tiny Networks
- Speed Always Wins: A Survey on Efficient Architectures for Large Language Models
- Faith and Fate: Limits of Transformers on Compositionality
Original note title
hierarchical dual-recurrence achieves effective computational depth that standard transformers cannot — enabling latent reasoning without chain of thought