Can reasoning systems scale wider instead of only deeper?
Explores whether sampling multiple parallel latent trajectories offers a faster scaling path than recursive refinement alone. Matters because it could unlock latency-efficient reasoning at test time.
Recursive Reasoning Models (RRMs) increase reasoning capability by iterating a shared transition function over a latent state — more iterations means more "thinking" without extending the output sequence. This is depth scaling, and it decouples reasoning depth from both parameter count and output length. GRAM (Generative Recursive reAsoning Models) argues this is only half the story: depth alone is insufficient because a single refinement path can become trapped in a suboptimal trajectory, and many problems have ambiguity or multiple valid solutions that a single converging path cannot represent.
The structural claim is that future recursive reasoners should be not only deep (repeated refinement) but also wide (maintaining and exploring multiple latent trajectories in parallel). GRAM operationalizes width by turning the latent transition stochastic and sampling several trajectories simultaneously. Crucially, width sidesteps the latency penalty that depth-only scaling incurs: sampling N trajectories runs in parallel, whereas adding N refinement steps is serial and accumulates wall-clock time.
This reframes the inference-scaling design space for latent architectures. It mirrors at the latent-state level what parallel-vs-sequential debates established at the token level — since Why does parallel reasoning outperform single chain thinking?, breadth often beats depth under a fixed budget because independent paths sample the solution distribution rather than inflating variance along one path. GRAM brings that lesson inside the recurrent block, where prior work like Can models reason without generating visible thinking tokens? had only scaled depth. The counterpoint to watch: since Can parallel architectures solve inherently sequential problems?, width cannot substitute for depth on inherently serial problems — the two axes are complements, not interchangeable knobs. Why it matters: it gives latent reasoning a second, latency-cheap scaling dimension and explains why deterministic RRMs underperform on multi-solution tasks.
Inquiring lines that use this note as a source 136
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Can steering a single latent feature replicate chain-of-thought performance?
- Why does single-model routing beat ensemble and cascade approaches on latency?
- Can this distillation pattern apply beyond e-commerce to other latency-constrained domains?
- How do larger models maintain more parallel tasks than smaller models?
- How does SONAR embedding quality affect downstream reasoning accuracy?
- Can latent reasoning architectures work as retrofits to existing models?
- Why do current RLVR methods fail to expand reasoning capability beyond base model boundaries?
- Can the scaling law for discovery extend beyond architectures to agentic systems?
- Does test-time compute actually substitute for having larger model parameters?
- What is the trade-off between parallel and sequential scaling at test time?
- What advantages emerge from running 13 times more parallel reasoning chains with the same budget?
- How does inference compute substitution affect the training parameter scaling trade-off?
- Does parallel thinking benefit disproportionately from higher inference throughput architectures?
- Does layer-wise prediction stabilization provide a stronger trace quality signal than confidence alone?
- How does open-ended evolver reasoning identify patterns across heterogeneous user trajectories?
- How does the three-component definition apply to test-time scaling laws?
- What scaling behavior do partial systems show without iterative query refinement?
- Can test-time scaling prioritize genuine reasoning over pattern matching?
- Why do scaling laws fail to predict optimal architectures at small parameter counts?
- Can sequential computation through depth solve problems that parallel width cannot?
- What makes diffusion sampling preserve multiple optimal solutions better than alternatives?
- How does latent space diffusion enable evolutionary search in high dimensions?
- Can accelerated sampling techniques from image generation speed up evolutionary search?
- Why is active observation more efficient than passive message passing?
- Does inference-time compute scaling require explicit reasoning traces or verifiable rewards?
- Can latent recurrence and energy minimization both escape the same computational depth constraints?
- Does population-based evolution transcend the parallel versus sequential compute tradeoff?
- Can parallel thinking outperform sequential thinking under the same token budget?
- Can adaptive compute distribution across prompts replace the need for sophisticated reasoning frameworks?
- Does parallel sampling avoid failed-branch contamination more than sequential thinking?
- How does test-time compute substitute for model parameter scaling?
- How should iterative research tasks limit context per reasoning turn?
- Can test-time compute on smaller models replace larger model inference?
- Why does policy entropy collapse limit reasoning and dialogue RL scaling?
- Can latent reasoning in continuous space scale beyond supervised reasoning tasks?
- Why does depth outperform width for sub-billion parameter models?
- What mechanisms drive test-time compute allocation in reasoning tasks?
- Can multi-agent reasoning systems scale beyond current architectures?
- Can parallel independent reasoning outperform sequential iterative refinement?
- Does test-time compute scaling work for agentic deep research tasks?
- How does per-token adaptive compute improve efficiency in recurrent reasoning?
- Does trading model size for inference steps improve overall efficiency scaling?
- Can breadth-first search in continuous space outperform chain-of-thought on logical tasks?
- Do latent sequence vectors outperform per-token latent iterative computation for reasoning?
- Why does iterative refinement amplify rather than correct reasoning errors?
- How does meta-reasoning combine information distributed across multiple chains?
- How does MCTS combine parallel exploration with sequential reasoning depth?
- What makes multi-hypothesis generation better than single-path social reasoning?
- Why do scaling laws show capability saturation at specific thresholds?
- Can depth scaling and breadth scaling unlock independent capability axes?
- Do models excel at reasoning depth or memory breadth when scaling test time compute?
- Can targeted activation steering surface latent reasoning in base models?
- What makes reasoning-specific post-training different from standard parameter scaling?
- How should inference-time token budgets vary across models of different capability levels?
- Can parallel reasoning chains outperform longer sequential chains with the same compute?
- Why does parallel thinking outperform sequential thinking under the same token budget?
- How does shared-memory parallelism compare to independent sampling and turn-based debate?
- When does sequential reasoning provide exponential advantages over parallel voting?
- What makes diverse reasoning sources more valuable than deeper single paths?
- When does sequential chain-of-thought dramatically beat parallel voting approaches?
- How does training data format shape whether models reason in parallel or sequentially?
- Can latent reasoning mechanisms and recursive tracking mechanisms be combined effectively?
- Why do reasoning chains degenerate into undirected exploration at scale?
- What makes parallel thinking more efficient than sequential chains?
- What tree depth is achievable before GPU memory becomes the bottleneck?
- Why does parallel thinking outperform sequential thinking under token limits?
- How do beam search and MCTS traverse reasoning topologies?
- How does scaling reasoning capability actually reduce instruction-following ability?
- Can latent space represent reasoning dimensions that text cannot?
- Do substitute networks converge differently than complement networks?
- What distinguishes hierarchical dual-recurrence from flat parameter-sharing recurrence?
- How does dynamic recurrence during training improve depth extrapolation?
- Can recursive sub-calls decompose reasoning across multiple context chunks?
- How does RL compress reasoning path diversity during training?
- Do higher asymptote recipes unlock genuinely novel reasoning strategies?
- What changes when reasoning models adopt trajectory-response output formats?
- Why does parallel sampling fail on graph connectivity tasks?
- What makes a problem fundamentally sequential versus parallelizable?
- How does precomputing context reasoning reduce latency in stateful applications?
- How does soft thinking compare to sampling multiple independent reasoning paths?
- How do parallel sampling and sequential depth compare as scaling dimensions?
- Why do sequential derivation and parallel agent modeling conflict?
- Does unrestricted reasoning per search step degrade iterative quality over time?
- Why does more inference compute amplify wandering rather than solving it?
- Can historical and batch exploration be implemented with the same algorithmic mechanism?
- How should trajectory-aware PRMs weight backtracking and planning sentences?
- What computational cost does trajectory-bursty inference impose on per-query context requirements?
- Are some problems fundamentally unsolvable by parallel inference methods?
- Does parallel generation outperform sequential revision with equal tokens?
- What makes sparse models inefficient to train and deploy at scale?
- How does decoupling reasoning from tool observations improve parallel execution?
- Can embedding-cluster routing outperform a single frontier model?
- How does directional diversity compare to other forms of parallel planning?
- What happens to iterative search quality when reasoning depth is unconstrained?
- Can abstract placeholders be filled in parallel without breaking reasoning chains?
- Does decoupling reasoning reduce inference cost more than sequential scaling?
- Can memory and test-time compute scale together as a single axis?
- What limits external scaling when a model lacks reasoning foundation?
- Can test-time scaling work through retrieval rather than reasoning?
- How do soft token mixtures enable parallel reasoning exploration without explicit training?
- Can deterministic recurrent depth achieve the computational benefits of stochastic reasoning?
- Why does parallel sampling become more efficient when reasoning branches are memoryless?
- Why do hybrid memory and compute sparsity outperform pure parameter scaling?
- Why does iterative refinement fail when information stays constant?
- Do linearized traces genuinely expand exploration beyond standard chain-of-thought?
- Can group-relative normalization be modified to resist shortcut trajectories?
- When is numeric computation the real bottleneck versus reasoning depth?
- How much training data is truly necessary to unlock latent model reasoning?
- Can reasoning happen in latent space without chain of thought?
- Can width-scaling replace depth-scaling on inherently sequential problems?
- How do continuous concept tokens compare to latent trajectory sampling?
- What makes deterministic recursive reasoning models underperform on multi-solution tasks?
- Can sleep-time compute reduce latency demands during model inference?
- What inference-time scaling benefits emerge from reasoning before each prediction?
- What output distribution properties make smaller models better for wide sampling?
- How do external invocation latencies drive technique convergence?
- Can other posterior approximation schemes match variational inference performance?
- What computational structures can actually scale serial reasoning depth?
- Can latent recurrence overcome the trainability costs of depth?
- What mechanisms activate latent reasoning capabilities already present in base models?
- Why does single-shot learning fail in REVTHINK's multi-source reasoning tasks?
- How do sleep-time and post-completion methods reduce inference latency?
- Should prompt design and inference scaling be optimized together or separately?
- Can test-time compute scaling substitute for larger model parameters?
- What architectural variables most improve inference efficiency today?
- Why does the right structural prior matter more than raw model capacity?
- Can structured workflows unlock latent reasoning abilities that raw models don't show?
- Why does recursion on latent state drive generalization better than hierarchy?
- How does latent reasoning recursion compare to chain-of-thought reasoning?
- Can a single recursive network replace hierarchical dual-network architectures?
- How do compact latent dynamics enable planning without explicit chain of thought?
- Should agents use parallel or sequential scaling during test time?
- What makes looped latent computation more efficient than scaling attention capacity?
- How does single-pass generation differ from multi-stage synthesis architecturally?
- What power-law scaling patterns emerge when consistency models are trained at scale?
- What computational stages does a looped block re-enact across multiple iterations?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Why does parallel reasoning outperform single chain thinking?
Does dividing a fixed token budget across multiple independent reasoning paths beat spending it all on one long chain? This explores how breadth and diversity in reasoning compare to depth.
the token-level analogue: breadth beats depth under fixed budget because independent paths sample the distribution
-
Can models reason without generating visible thinking tokens?
Explores whether intermediate reasoning must be verbalized as text tokens, or if models can think in hidden continuous space. Challenges a foundational assumption about how language models scale their reasoning capabilities.
the depth-only RRM baseline GRAM extends with a width axis
-
Can parallel architectures solve inherently sequential problems?
Complexity theory suggests some problems like reasoning and planning are fundamentally sequential. Can parallel architectures like Transformers overcome this limitation, or do we need fundamentally different computational approaches?
the limit: width cannot replace depth on inherently sequential problems
-
Can we explore multiple reasoning paths without committing to one token?
Standard language models pick one token at each step, collapsing uncertainty and forcing single reasoning trajectories. Could preserving the full probability distribution across token embeddings enable implicit parallel exploration instead?
a related parallel-exploration mechanism, but in concept-token space rather than recurrent latent space
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
- RLAD: Training LLMs to Discover Abstractions for Solving Reasoning Problems
- Generative Recursive Reasoning
- Learning to Think: Information-Theoretic Reinforcement Fine-Tuning for LLMs
- ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning
- Let Me Think! A Long Chain-of-Thought Can Be Worth Exponentially Many Short Ones
- Retrieval-augmented reasoning with lean language models
- Reasoning Models Are More Easily Gaslighted Than You Think
Original note title
reasoning systems should scale in width by sampling parallel latent trajectories not only in depth