Can tiny recursive networks outperform massive language models?
Does a small network that refines its reasoning through recursion on a latent state actually generalize better than billion-parameter LLMs on hard puzzles like ARC-AGI? What makes recursion more powerful than scale?
Autoregressive LLMs are fragile on hard puzzles because a single wrong token can invalidate an answer, and the usual patches — chain-of-thought and test-time compute — are expensive, data-hungry, and brittle. The Tiny Recursive Model (TRM) takes the opposite bet: a single 2-layer network with only 7M parameters that recurses on its own latent reasoning feature and progressively improves its final answer. It reaches 45% on ARC-AGI-1 and 8% on ARC-AGI-2 — higher than most LLMs including DeepSeek R1, o3-mini, and Gemini 2.5 Pro — with less than 0.01% of their parameters.
The keeper is what TRM removes relative to its predecessor HRM: no fixed-point theorem, no biological hierarchy, no two interacting networks, no extra halting forward pass. A single tiny network recursing beats the hierarchical version, which isolates recursion on a latent state — not scale, not hierarchy — as the source of generalization. (The authors are candid that no single choice is universally optimal: replacing self-attention with an MLP helped Sudoku but hurt other tasks, so architecture still needs per-problem tuning and scaling laws.)
This sharpens the vault's recurrence cluster. TRM directly simplifies Can recurrent hierarchies achieve reasoning that transformers cannot? (HRM), and it agrees mechanistically with How do looped transformer layers actually behave during inference?: recursion re-applies computation on a latent state, and that reuse — at tiny scale — is what generalizes.
Inquiring lines that use this note as a source 6
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Do KANs maintain their advantages in deep architectures and large-scale training?
- Why does recursion on latent state drive generalization better than hierarchy?
- Can a single recursive network replace hierarchical dual-network architectures?
- How does upward distillation transfer knowledge from smaller to larger networks?
- Can a two-layer network outgeneralize billion-parameter models through recursion alone?
- Where do neural networks still fail at compositional generalization despite scaling?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can recurrent hierarchies achieve reasoning that transformers cannot?
Can a dual-timescale recurrent architecture escape the computational limitations of standard transformers and solve complex reasoning tasks without explicit chain-of-thought? This explores whether architectural design, not scale, enables true algorithmic reasoning.
TRM strips HRM to one tiny network and generalizes better, isolating recursion from hierarchy
-
How do looped transformer layers actually behave during inference?
When language models loop their layers to improve reasoning, do they discover new computations or repeat existing ones? Understanding the internal dynamics could explain why recurrent architectures outperform simple depth scaling.
mechanistic agreement: recursion reuses computation on a latent state
-
Can looped transformers generalize to unseen knowledge combinations?
Do transformers that reuse layers across iterations succeed where standard transformers fail at composing facts in novel ways? This matters because systematic generalization is a hallmark of human reasoning.
both show recurrent depth buys generalization vanilla fixed-depth models lack
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Less is More: Recursive Reasoning with Tiny Networks
- Recursive Language Models
- MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases
- Hierarchical Reasoning Model
- Beyond Scaling Law: A Data-Efficient Distillation Framework for Reasoning
- The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
- A Mechanistic Analysis of Looped Reasoning Language Models
- ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning
Original note title
a tiny two-layer network recursing on its latent reasoning state out-generalizes billion-parameter LLMs on hard puzzles — recursion not scale or hierarchy drives the gain