What triggers overthinking versus underthinking in reasoning models?
This explores what actually flips a reasoning model between two opposite failure modes — burning tokens redundantly (overthinking) versus bailing on good ideas too early (underthinking) — and whether the triggers are about problem difficulty, training, or something in the decoding itself.
This explores what actually flips a reasoning model between two opposite failure modes — burning tokens redundantly (overthinking) versus bailing on good ideas too early (underthinking). The corpus suggests the trigger is less about how much the model thinks and more about *calibration*: whether the model's confidence and its sense of when to stop match the difficulty of the problem in front of it.
The sharpest single finding is that the two failures map onto problem difficulty in opposite directions — models overthink easy problems and underthink hard ones Does more thinking time always improve reasoning accuracy?. Accuracy isn't monotonic in thinking length; it peaks at a task-specific token count and then falls off a cliff (one study clocks 87.3% down to 70.3% as tokens climb from ~1,100 to ~16,000), because extended thinking starts inflating output variance and introducing self-revision errors rather than fixing anything When does thinking too much actually hurt reasoning?. So overthinking isn't just wasted compute — past a threshold it actively corrupts a correct answer.
Underthinking has a more mechanical trigger: premature thought-switching. Models abandon promising reasoning paths mid-exploration, scattering tokens across incomplete approaches — exploring "like tourists, not scientists" Why do reasoning models abandon promising solution paths?. Strikingly, you can fix this at decoding time without retraining: a penalty on thought-transition tokens discourages the bailing and improves accuracy on hard math Do reasoning models switch between ideas too frequently?. That points to confidence as the underlying dial — when a model can't commit, it switches; when it's overconfident, it pads. ReBalance reads confidence variance and overconfidence directly as diagnostic signals, then applies training-free steering to suppress redundancy during overthinking and push exploration during underthinking Can confidence patterns reveal overthinking versus underthinking?.
Two deeper triggers sit underneath. First, training quality, not quantity, decides whether thinking even helps: vanilla models use "thinking mode" to induce self-doubt that *degrades* performance, and RL training reverses the very same mechanism into productive gap analysis Does extended thinking help or hurt model reasoning?. Second, models lack a stop signal entirely for ill-posed inputs — given a question with a missing premise, reasoning models spiral into long redundant chains while non-reasoning models simply flag it as unanswerable. Training optimizes for *producing* reasoning steps but never teaches *when to disengage* Why do reasoning models overthink ill-posed questions?.
The unsettling thread, if you want to pull it: longer reasoning chains aren't just inefficient, they're a liability surface. Each extra step is another intervention point where a single corrupted step propagates — which is why reasoning models are *more* vulnerable to manipulative multi-turn prompts than plain models, losing 25–29% accuracy Why do reasoning models fail under manipulative prompts?. And there's a measurement angle worth knowing exists: a "deep-thinking ratio" tracks how many tokens actually get revised across model layers, distinguishing genuine reasoning effort from the appearance of it Can we measure how deeply a model actually reasons? — useful precisely because the visible length of a reasoning trace tells you almost nothing about whether real work is happening.
Sources 9 notes
Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.
Empirical studies demonstrate non-monotonic scaling in test-time reasoning: accuracy peaks at a critical thinking-token count, then declines sharply (87.3% to 70.3% as tokens scale from 1,100 to 16,000). Extended thinking inflates output variance and introduces self-revision errors rather than improving solution quality.
Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.
o1-like models frequently abandon reasoning paths mid-exploration, wasting tokens on incomplete approaches. A decoding-only penalty on thought-transition tokens (TIP strategy) discourages switching, improving accuracy on challenging math without model fine-tuning.
ReBalance uses confidence variance and overconfidence as diagnostic signals to apply training-free steering vectors that reduce overthinking redundancy while promoting exploration during underthinking, improving accuracy across models from 0.5B to 32B parameters.
Vanilla models use thinking mode counterproductively, inducing self-doubt that degrades performance. RL training reverses this, transforming the same mechanism into beneficial gap analysis. Training mediates reasoning quality, not just quantity.
Reasoning models generate redundant, lengthy responses to questions with missing premises while non-reasoning models correctly identify them as unanswerable. Training optimizes for producing reasoning steps but never teaches models when to disengage.
GaslightingBench-R demonstrates that o1 and R1 models are more vulnerable to multi-turn adversarial prompts than standard models. Extended reasoning chains create more intervention points where single corrupted steps propagate through elaboration.
Deep-thinking ratio (DTR) measures the proportion of tokens whose predictions undergo significant revision across model layers, correlating robustly with accuracy across AIME, HMMT, and GPQA benchmarks. Think@n, a test-time strategy using DTR, matches self-consistency performance while reducing inference costs.