Can reasoning systems scale wider instead of only deeper?

Explores whether sampling multiple parallel latent trajectories offers a faster scaling path than recursive refinement alone. Matters because it could unlock latency-efficient reasoning at test time.

Synthesis note · 2026-05-28 · sourced from Reasoning Architectures

Recursive Reasoning Models (RRMs) increase reasoning capability by iterating a shared transition function over a latent state — more iterations means more "thinking" without extending the output sequence. This is depth scaling, and it decouples reasoning depth from both parameter count and output length. GRAM (Generative Recursive reAsoning Models) argues this is only half the story: depth alone is insufficient because a single refinement path can become trapped in a suboptimal trajectory, and many problems have ambiguity or multiple valid solutions that a single converging path cannot represent.

The structural claim is that future recursive reasoners should be not only deep (repeated refinement) but also wide (maintaining and exploring multiple latent trajectories in parallel). GRAM operationalizes width by turning the latent transition stochastic and sampling several trajectories simultaneously. Crucially, width sidesteps the latency penalty that depth-only scaling incurs: sampling N trajectories runs in parallel, whereas adding N refinement steps is serial and accumulates wall-clock time.

This reframes the inference-scaling design space for latent architectures. It mirrors at the latent-state level what parallel-vs-sequential debates established at the token level — since Why does parallel reasoning outperform single chain thinking?, breadth often beats depth under a fixed budget because independent paths sample the solution distribution rather than inflating variance along one path. GRAM brings that lesson inside the recurrent block, where prior work like Can models reason without generating visible thinking tokens? had only scaled depth. The counterpoint to watch: since Can parallel architectures solve inherently sequential problems?, width cannot substitute for depth on inherently serial problems — the two axes are complements, not interchangeable knobs. Why it matters: it gives latent reasoning a second, latency-cheap scaling dimension and explains why deterministic RRMs underperform on multi-solution tasks.

Inquiring lines that use this note as a source 136

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

13 direct connections · 119 in 2-hop network ·dense cluster Open in graph ↗

Can reasoning systems scale wider instead of onl… Why does parallel reasoning outperform single chai… Can models reason without generating visible think… Can parallel architectures solve inherently sequen… Can we explore multiple reasoning paths without co…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Why does parallel reasoning outperform single chain thinking? Does dividing a fixed token budget across multiple independent reasoning paths beat spending it all on one long chain? This explores how breadth and diversity in reasoning compare to depth.
the token-level analogue: breadth beats depth under fixed budget because independent paths sample the distribution
Can models reason without generating visible thinking tokens? Explores whether intermediate reasoning must be verbalized as text tokens, or if models can think in hidden continuous space. Challenges a foundational assumption about how language models scale their reasoning capabilities.
the depth-only RRM baseline GRAM extends with a width axis
Can parallel architectures solve inherently sequential problems? Complexity theory suggests some problems like reasoning and planning are fundamentally sequential. Can parallel architectures like Transformers overcome this limitation, or do we need fundamentally different computational approaches?
the limit: width cannot replace depth on inherently sequential problems
Can we explore multiple reasoning paths without committing to one token? Standard language models pick one token at each step, collapsing uncertainty and forcing single reasoning trajectories. Could preserving the full probability distribution across token embeddings enable implicit parallel exploration instead?
a related parallel-exploration mechanism, but in concept-token space rather than recurrent latent space

Can reasoning systems scale wider instead of only deeper?

Related concepts in this collection 4

Related papers in this collection 8

Search by related questions 4