What inference strategy works better than forcing self-revision under token constraints?
This explores whether there's a smarter way to spend a limited token budget than making a model loop back and revise its own answer — and the corpus says yes, with self-revision being one of the weaker bets.
This reads the question as: given a fixed token budget, is forcing a model to second-guess and rewrite its own reasoning actually the best use of those tokens? The corpus suggests it's often the worst one. Self-revision in o1-style models tends to *degrade* accuracy rather than improve it — across QwQ, R1, and LIMO, most revisions keep a wrong answer wrong, and smaller models frequently flip a correct answer to an incorrect one mid-revision. Worse, longer chains with more revision steps correlate with *lower* accuracy, so spending tokens on self-correction can be actively counterproductive Does self-revision actually improve reasoning in language models?.
The more promising direction is to spend tokens on exploring multiple reasoning paths in parallel instead of committing to one and then patching it. Soft Thinking does exactly this: rather than picking a single discrete token at each step (and later having to revise that commitment), it keeps the model's probability distribution alive as a continuous 'concept token,' preserving a superposition of possible paths. The payoff is concrete — up to 2.48 points of accuracy *while cutting tokens by 22.4%* through entropy-based early stopping. That's the inverted trade-off: better answers for fewer tokens, the opposite of revision's more-tokens-for-worse-answers Can we explore multiple reasoning paths without committing to one token?.
There's a deeper reason this works, which is where the corpus gets interesting. Not all tokens carry equal weight. Only about 20% of tokens are high-entropy 'forking points' where the reasoning actually branches — and these are what drive learning and decision-making Do high-entropy tokens drive reasoning model improvements?. Independently, models internally rank tokens by functional importance, preferentially preserving the symbolic-computation steps and discarding grammar and meta-discourse first Which tokens in reasoning chains actually matter most?. So if a minority of tokens does the real work, a strategy that invests budget at the genuine decision points (like preserving the distribution at forks) beats one that burns tokens re-litigating an already-committed chain.
There's also a principled ceiling on why self-revision can't save itself. Self-improvement is formally bounded by the generation–verification gap: a model can't reliably validate its own fixes without something external to check against, so metacognitive looping alone hits a wall What stops large language models from improving themselves?. This matches the broader finding that reflective fluency doesn't equal competence — frontier reasoning models manage only 20–23% on constraint-satisfaction problems that demand genuine backtracking, the exact thing revision is supposed to deliver Can reasoning models actually sustain long-chain reflection?.
The thing you may not have known you wanted to know: the corpus reframes the whole 'inference strategy' question. Instead of treating reasoning tokens as meaningful steps that should be checked and corrected, several notes suggest they function more like *computational scaffolding* — models trained on deliberately corrupted traces perform comparably to those trained on correct ones Do reasoning traces need to be semantically correct?. If the trace is scaffolding rather than literal logic, then forcing the model to revise the *content* of that scaffolding is aimed at the wrong target — and parallel exploration that preserves uncommitted options is the better place to put your tokens.
Sources 7 notes
Evidence from QwQ, R1, and LIMO shows most revisions retain wrong answers rather than correcting them. Smaller models frequently switch correct answers to incorrect during revision, and longer chains with more revisions correlate with lower accuracy.
Training-free method replaces discrete token selection with probability-weighted concept embeddings, preserving superposition of reasoning paths. Improves accuracy up to 2.48 points while reducing tokens 22.4% via entropy-based early stopping.
Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.
Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.
Self-improvement in LLMs is formally bounded by the generation-verification gap, meaning every reliable fix requires something external to validate and enforce it. Models cannot escape this constraint through metacognition alone.
DeepSeek-R1 and o1-preview achieve only 20-23.6% exact match on 850 constraint satisfaction problems requiring genuine backtracking. This ceiling reveals that reflective reasoning fluency does not translate to actual problem-solving competence on unfamiliar instance structures.
Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.