Can recurrent hierarchies achieve reasoning that transformers cannot?

Can a dual-timescale recurrent architecture escape the computational limitations of standard transformers and solve complex reasoning tasks without explicit chain-of-thought? This explores whether architectural design, not scale, enables true algorithmic reasoning.

Synthesis note · 2026-02-23 · sourced from Novel Architectures

The Hierarchical Reasoning Model (HRM) is a recurrent architecture with two coupled modules: a high-level (H) module for slow, abstract planning and a low-level (L) module for fast, detailed computation. The key mechanism is "hierarchical convergence" — the fast L-module completes multiple computational steps and reaches local equilibrium, then the slow H-module advances, and L is reset for a new phase. This avoids the rapid premature convergence that plagues standard recurrent models.

The results are striking. With only 27 million parameters and 1,000 training samples, no pre-training or CoT data, HRM achieves near-perfect accuracy on Sudoku-Extreme Full and optimal pathfinding in 30×30 mazes — tasks where state-of-the-art CoT methods achieve 0% accuracy. It outperforms much larger models with significantly longer context windows on ARC, a key AGI benchmark.

The architecture is brain-inspired: the human brain organizes computation hierarchically across cortical regions operating at different timescales. Recurrent feedback loops iteratively refine representations — slow higher-level areas guide, fast lower-level circuits execute. The brain achieves this depth without backpropagation through time.

HRM mirrors this with an O(1) memory gradient approximation. Because each recurrent module converges to a fixed point, gradients can be computed at equilibrium in a single step rather than unrolling through time. The gradient path is: output head → final H-state → final L-state → input embedding. No BPTT, no O(T) memory. This aligns with neuroscience evidence that cortical credit assignment uses short-range, temporally local mechanisms.

The deeper implication: standard Transformers are "paradoxically shallow" despite deep learning's founding principle of stacking layers. Their fixed depth places them in AC0/TC0 complexity classes — they are not Turing-complete and cannot execute complex algorithmic reasoning in a purely end-to-end manner. HRM's hierarchical recurrence escapes this constraint by achieving effectively unbounded computational depth.

This extends Can models reason without generating visible thinking tokens? with a third distinct architecture beyond depth-recurrent and Heima — one that introduces hierarchical multi-timescale processing rather than uniform recurrence.

Inquiring lines that use this note as a source 54

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

17 direct connections · 116 in 2-hop network ·medium cluster Open in graph ↗

Can recurrent hierarchies achieve reasoning that… Can models reason without generating visible think… Can models reason without generating visible think… Can parallel architectures solve inherently sequen… Does more thinking time actually improve LLM reaso… Can energy minimization unlock reasoning without d…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can models reason without generating visible thinking tokens? Explores whether intermediate reasoning must be verbalized as text tokens, or if models can think in hidden continuous space. Challenges a foundational assumption about how language models scale their reasoning capabilities.
HRM adds hierarchical dual-module architecture as a third latent reasoning approach
Can models reason without generating visible thinking steps? Do machine reasoning systems actually require verbalized chains of thought, or can they solve complex problems through hidden computation? This challenges how we measure and understand reasoning.
HRM provides the strongest empirical evidence: near-perfect on tasks where verbalized CoT fails completely
Can parallel architectures solve inherently sequential problems? Complexity theory suggests some problems like reasoning and planning are fundamentally sequential. Can parallel architectures like Transformers overcome this limitation, or do we need fundamentally different computational approaches?
HRM is an architecture that implements serial scaling through hierarchical recurrence
Does more thinking time actually improve LLM reasoning? The intuition that extended thinking helps LLMs reason better seems obvious, but what does the empirical data actually show when we test it directly?
HRM challenges from another direction: right architecture > more thinking tokens
Can energy minimization unlock reasoning without domain-specific training? Can a gradient descent-based architecture achieve system 2 thinking across any modality or problem type using only unsupervised learning, without verifiers or reasoning-specific rewards?
alternative latent reasoning architecture: HRM uses hierarchical recurrence for serial depth, EBTs use energy minimization for iterative refinement; both escape TC0 limitation without verbalized tokens but via fundamentally different mechanisms

Can recurrent hierarchies achieve reasoning that transformers cannot?

Related concepts in this collection 5

Related papers in this collection 8

Search by related questions 5