When are multiple independent attempts more valuable than depth?
This explores when sampling many independent reasoning paths (breadth) beats pushing a single chain further (depth) — and what makes depth waste compute.
This explores the breadth-vs-depth tradeoff in how models reason: when does generating several independent attempts and aggregating them beat extending one chain of thought? The corpus lands on a clear pattern — breadth wins precisely where extra depth stops adding information and starts adding error. Under a fixed token budget, running multiple independent reasoning paths and taking a majority vote outperforms stretching a single chain by up to 22% Why does parallel reasoning outperform single chain thinking?. The reason is diagnostic: extending one chain inflates variance without improving correctness, while diversity across paths samples the model's actual capability more faithfully.
Why does depth disappoint? Several notes converge on the same failure. Long chains accumulate error step by step — genuine reasoning exists in CoT, but each additional step compounds noise What three separate factors drive chain-of-thought performance?, which is why accuracy traces an inverted-U: it peaks at an intermediate length and then declines, with capable models preferring shorter chains Why does chain of thought accuracy eventually decline with length?. Longer isn't even a reliable signal of harder thinking — trace length tracks how close a problem sits to the training distribution, not problem difficulty Does longer reasoning actually mean harder problems?. And depth-only reasoning has its own pathology: models abandon promising paths mid-stream (underthinking), so penalizing premature thought-switching actually raises accuracy Do reasoning models switch between ideas too frequently?.
The deeper insight is that breadth and depth aren't really competitors for the same resource. The exploration-exploitation 'tradeoff' turns out to be a measurement artifact that only appears at the token level — hidden-state analysis shows the two can be enhanced simultaneously Is the exploration-exploitation trade-off actually fundamental?. That reframes the question: the win isn't 'spend tokens on breadth instead of depth,' it's 'structure the breadth well.' Allocating test-time compute to diverse abstractions beats naive parallel solution sampling at large budgets, because abstractions enforce a structured breadth-first search instead of redundant guesses Can abstractions guide exploration better than depth alone?. Even within a single trace, you can recover breadth: completing from intermediate reasoning points and taking the mode answer beats the final conclusion by up to 13%, because it mines alternatives before early commitment narrows the space Can intermediate reasoning points yield better answers than final ones?.
What you might not expect: more attempts only help if you keep the good ones. Quality of aggregation matters more than raw quantity — step-level confidence filtering catches breakdowns that global averaging masks and matches majority-voting accuracy with far fewer traces Does step-level confidence outperform global averaging for trace filtering?. And the very capacity to hold multiple live possibilities can be built into the model: making latent reasoning stochastic lets a recursive reasoner represent a distribution over solutions rather than commit to one, which is breadth pushed inside the architecture itself Can stochastic latent reasoning help models explore multiple solutions?.
So the answer to 'when': independent attempts beat depth when the problem is out-of-distribution (where long traces are just recalled schemas), when a single chain would compound error past the inverted-U peak, and whenever you have a way to select among diverse paths rather than averaging them flat. Depth is worth it only up to the point where the model is still adding signal — past that, the budget is better spent going wide.
Sources 10 notes
Multiple independent reasoning paths with majority voting achieve up to 22% higher accuracy than extending a single chain under the same token budget. Parallel diversity samples reasoning capability more faithfully than sequential extension, which inflates variance without improving correctness.
A shift cipher study decomposed CoT into three independent factors: output probability alone swings accuracy from 26% to 70%, memorization matches pre-training frequency patterns, and genuine reasoning exists but accumulates error with each step. This resolves the reason-or-memorize debate by showing LLMs do both simultaneously.
Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.
Controlled A* maze experiments show trace length correlates with difficulty only in-distribution but decouples entirely out-of-distribution. Trace length primarily reflects recall of training schemas, not adaptive computation.
o1-like models frequently abandon reasoning paths mid-exploration, wasting tokens on incomplete approaches. A decoding-only penalty on thought-transition tokens (TIP strategy) discourages switching, improving accuracy on challenging math without model fine-tuning.
Hidden-state analysis using Effective Rank metrics shows near-zero correlation between exploration and exploitation, revealing the trade-off emerges only at token level. VERL demonstrates simultaneous enhancement achieving 21.4% accuracy gains on Gaokao 2024.
RLAD jointly trains abstraction and solution generators, showing that allocating test-time compute to diverse abstractions outperforms parallel solution sampling at large budgets. Abstractions create structured breadth-first exploration that prevents the underthinking failure mode of depth-only reasoning chains.
Segmenting reasoning traces into subthoughts and prompting completions from each intermediate point yields mode answers up to 13% more accurate than final answers. This works because it mines alternative paths before early commitment narrows the solution space.
Local step-level confidence catches reasoning breakdowns that global averaging masks and enables early stopping before traces complete. This approach achieves comparable accuracy gains to naive majority voting with far fewer generated traces, proving trace quality matters more than quantity.
GRAM replaces deterministic latent updates with stochastic sampling, enabling models to represent distributions over solutions rather than single predictions. This allows handling of ambiguous problems and multiple valid strategies that deterministic designs cannot represent.