Why do reasoning models reduce effort despite having token budget remaining?
This explores why reasoning models stop short — cutting their reasoning effort even when they still have tokens to spend — and what that says about how reasoning is learned and structured rather than simply metered out.
This explores why reasoning models stop short — abandoning effort despite an unspent token budget — and the corpus suggests the answer is structural, not arithmetic: a reasoning model's effort isn't a smooth function of available compute, so leftover budget doesn't translate into more useful thinking. The clearest framing comes from work on what looks like wanderlust: models 'explore like tourists, not scientists,' switching away from promising solution paths prematurely — a failure the authors call underthinking, distinct from running out of room Why do reasoning models abandon promising solution paths?. Tellingly, the fix isn't more budget but a decoding-level thought-switching penalty that keeps the model on a path long enough to finish it. The viable solution was already reachable; the model bailed early.
A second piece of the picture: more thinking can actively hurt, so a well-calibrated model has reason to quit. Accuracy is non-monotonic — pushing thinking tokens from ~1,100 to ~16K dropped benchmark accuracy from 87.3% to 70.3%, with models overthinking easy problems and underthinking hard ones Does more thinking time always improve reasoning accuracy?. If effort past a threshold degrades answers, reduced effort on an easy prompt is the right move; the pathology is that the model's internal sense of 'enough' is miscalibrated against actual difficulty, not that it's lazy with a full tank.
This connects to how reasoning effort is distributed in the first place. The learning signal lives in a minority of tokens: only ~20% are high-entropy 'forking points' where the model actually decides something, and training on just those matches full updates Do high-entropy tokens drive reasoning model improvements?. Relatedly, models internally rank tokens by functional importance, preserving symbolic computation and pruning grammar and meta-discourse first Which tokens in reasoning chains actually matter most?. If most tokens are scaffolding and the real work is a handful of decisions, then 'effort' isn't measured in budget consumed — a model can resolve a problem in a few pivotal tokens and have nothing productive left to spend the rest on. Budget remaining ≠ reasoning remaining.
The deepest version of this: reasoning effort and visible token generation may be decoupled entirely. Models can scale test-time compute in latent space without verbalizing steps Can models reason without generating visible thinking tokens?, and transformers have been caught computing the correct answer in early layers, then overwriting it with format-compliant filler Do transformers hide reasoning before producing filler tokens?. Corrupted traces teach as well as correct ones, suggesting the visible chain is computational scaffolding more than meaningful reasoning Do reasoning traces need to be semantically correct?. Under this view, a model reducing visible effort may have already finished thinking — the tokens were never where the reasoning happened.
What ties it together is that effort should be allocated by difficulty, not spent because it's available. Compute-optimal scaling shows that reallocating the same budget — less for easy prompts, more for hard ones — beats uniform spending Can we allocate inference compute based on prompt difficulty?, and training models under budgets that start generous then tighten teaches them to compress effort deliberately Does gradually tightening token budgets beat fixed budget training?. So the surprising takeaway: a reasoning model that quits with budget to spare may be doing exactly what good training rewards — the problem is only that its difficulty estimate is wrong, which is why parallel sampling, which spreads the same budget across independent paths, so often beats grinding a single chain longer Why does parallel reasoning outperform single chain thinking?.
Sources 10 notes
Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.
Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.
Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.
Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.
Multiple architectures—depth-recurrent models, Heima, and Coconut—demonstrate that test-time compute scales through hidden state iteration rather than token generation. This suggests verbalization is a training artifact, not a reasoning requirement.
Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.
Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.
Research shows inference effectiveness varies dramatically by prompt difficulty. Reallocating the same total compute adaptively—giving easy prompts less and hard ones more—substantially outperforms larger models under uniform budgets.
Models trained with progressively tightening token budgets consistently achieve higher accuracy and better token efficiency than fixed-budget baselines. The approach works by separating learning into exploration (discovering strategies with generous budgets) and compression (distilling them under constraints).
Multiple independent reasoning paths with majority voting achieve up to 22% higher accuracy than extending a single chain under the same token budget. Parallel diversity samples reasoning capability more faithfully than sequential extension, which inflates variance without improving correctness.