Does parallel generation outperform sequential revision with equal tokens?
This explores whether running several reasoning attempts in parallel (and voting) actually beats having a model write one chain and then revise it — when both spend the same number of tokens.
This explores whether parallel generation beats sequential revision under a fixed token budget — and the corpus answers it surprisingly cleanly: under equal tokens, breadth tends to win over depth, and revision in particular is often a net negative. The most direct evidence is that multiple independent reasoning paths with majority voting reach up to 22% higher accuracy than extending a single chain on the same budget, because parallel diversity samples the model's true capability while stretching one chain mostly inflates variance without adding correctness Why does parallel reasoning outperform single chain thinking?. The revision side of the ledger is worse than just inefficient: in o1-style models, most self-revisions keep a wrong answer, and smaller models frequently flip a correct answer to an incorrect one — longer chains with more revision steps correlate with *lower* accuracy Does self-revision actually improve reasoning in language models?. So the naive read is 'parallel wins, stop revising.'
But the interesting part is the boundary condition, because the corpus also contains a clean counterexample. On genuinely compositional problems — ones where step N literally requires the result of step N-1, like graph connectivity — sequential chain-of-thought has an *exponential* advantage over parallel voting, because short independent chains simply cannot accumulate the intermediate results the problem demands When does sequential reasoning beat parallel voting?. The reconciliation: parallel sampling wins when the bottleneck is *sampling the solution space* (the model can sort of get there, you just need enough rolls of the dice), and sequential depth wins when the bottleneck is *accumulating dependent computation* (no amount of parallel rolls substitutes for the chained intermediate state). 'Equal tokens' isn't one question — it's two, decided by whether your task is wide or deep.
There's also a third option the question's framing hides: you don't have to choose between many discrete chains and one revised chain. GRAM scales 'width' by sampling parallel *latent* trajectories, getting token-level parallelism's benefits without depth-only latency Can reasoning systems scale wider instead of only deeper?. Soft Thinking keeps the whole probability distribution alive as continuous 'concept tokens' so multiple reasoning paths stay in superposition rather than committing to one token — and it does this while *cutting* tokens ~22% Can we explore multiple reasoning paths without committing to one token?. Diffusion-style models go further and dissolve the parallel-vs-sequential distinction entirely: ICE refines reasoning and answer *simultaneously* in place, with answer confidence converging early enough to early-exit and halve compute Can reasoning and answers be generated separately in language models?. These suggest the real efficiency frontier isn't 'more parallel votes' but 'explore breadth without paying for discrete sampling.'
Why is sequential revision so weak in the first place? Two deeper notes hint at it. Autoregressive generation has no retraction primitive — it can't take back an emitted token — which is exactly why it stumbles on constraint problems that depend on discarding bad partial work Why does autoregressive generation fail at constraint satisfaction?. 'Revision' in a left-to-right model isn't real backtracking; it's appending more text and hoping the continuation overrides the earlier mistake, which is why it so often doesn't. And self-improvement through revision is formally bounded anyway: a model can't reliably verify and fix itself without an external signal, so iterating in place hits a ceiling that more parallel samples (each an independent draw) partly sidestep What stops large language models from improving themselves?.
The thing you didn't know you wanted to know: 'parallel beats sequential at equal tokens' is true on average but is really a statement about *what the architecture can and can't do* — autoregressive models are good at independent re-sampling and bad at retraction, so parallel voting plays to their strength and revision plays to their weakness. The frontier work isn't picking a winner; it's changing the substrate (latent trajectories, concept-token superposition, bidirectional diffusion) so a model can explore widely *and* refine in place without either tax.
Sources 8 notes
Multiple independent reasoning paths with majority voting achieve up to 22% higher accuracy than extending a single chain under the same token budget. Parallel diversity samples reasoning capability more faithfully than sequential extension, which inflates variance without improving correctness.
Evidence from QwQ, R1, and LIMO shows most revisions retain wrong answers rather than correcting them. Smaller models frequently switch correct answers to incorrect during revision, and longer chains with more revisions correlate with lower accuracy.
On structured tasks requiring sequential multi-step reasoning like graph connectivity, chain-of-thought achieves exponentially higher accuracy than parallel voting. The difference emerges because solutions genuinely require accumulating intermediate results sequentially, which short parallel chains cannot achieve.
GRAM shows that stochastic latent transitions enabling parallel trajectory sampling sidestep the serial latency cost of depth-only scaling. Width matches token-level parallelism benefits: independent paths sample the solution space without variance inflation.
Training-free method replaces discrete token selection with probability-weighted concept embeddings, preserving superposition of reasoning paths. Improves accuracy up to 2.48 points while reducing tokens 22.4% via entropy-based early stopping.
ICE shows that bidirectional attention in diffusion LLMs enables in-place prompting—embedding reasoning directly in masked positions refined alongside answers. Answer confidence converges early while reasoning continues refining, allowing early-exit mechanisms to cut compute by 50% while maintaining accuracy.
The performance ceiling on constraint satisfaction problems is not a model-quality issue but an architectural limitation: autoregressive transformers cannot retract emitted tokens, while CSP solvers fundamentally depend on discarding invalid partial assignments. Symbolic solver integration works because it supplies what the architecture lacks.
Self-improvement in LLMs is formally bounded by the generation-verification gap, meaning every reliable fix requires something external to validate and enforce it. Models cannot escape this constraint through metacognition alone.