Does more thinking time always improve reasoning accuracy?
Explores whether extending a model's thinking tokens linearly improves performance, or if there's a point beyond which additional reasoning becomes counterproductive.
The prevailing assumption that "more thinking tokens = better reasoning" is empirically false beyond a critical point. Pushing the average thinking token count from ~1,100 to ~15,980 reduced accuracy from 87.3% to 70.3% on the same benchmark.
This non-monotonic relationship — initial improvement followed by steady decline — is consistent across multiple tasks and datasets. The researchers call the degradation phase "overthinking," and it has been largely invisible in prior work because most studies only reported the improving phase of the curve.
The practical implication: there is a sweet spot, and token budgets above it actively harm performance. Current practice of using "more tokens" as a proxy for "more reasoning" is not just wasteful — it is counterproductive past the threshold. Since Does extended thinking actually improve reasoning or just increase variance?, the gains before the threshold aren't even what they appear to be.
The bidirectional calibration failure (Between Underthinking and Overthinking): The relationship is not just non-monotonic — models miscalibrate in both directions. For easy questions, models often detect difficulty increases and extend reasoning appropriately. But for hard questions beyond their capability, models underthink — failing to recognize difficulty or lacking the knowledge to respond effectively, producing responses shorter than needed. The result: models overthink easy problems (generating unnecessarily long outputs) and underthink hard ones (failing to extend reasoning when most needed).
Length-based preference optimization provides a surprising intervention: fine-tuning to prefer shorter responses — using only unlabeled data, without ground-truth labels — maintains relatively strong accuracy while reducing token length. The reduction is disproportionately from incorrect responses (which are significantly longer), but 10-25% reduction on correct responses is also observed. This suggests models have latent ability to calibrate difficulty for easy problems but retain an overthinking tendency that preference optimization can reduce.
PI framework: the attention-level mechanism behind the threshold: The PI (Test-time Prompt Intervention) framework provides the attention-level mechanism that explains why the threshold exists. Visualizing attention maps across reasoning steps reveals that verification and backtracking steps (e.g., steps 7-8 in a typical trace) receive minimal subsequent attention — the model generates them but barely reads them. After generating the correct answer step, all following steps predominantly attend to that pivotal moment rather than to intermediate verification. The critical steps — those whose predecessors all receive high attention — can reproduce the reasoning with 75% fewer steps. This transforms the behavioral observation (accuracy degrades with more tokens) into a mechanistic explanation: redundant tokens are attention-invisible, contributing neither signal nor structure to the final answer. The overthinking region is precisely where token generation has detached from the attention graph that actually drives outputs. Source: Prompts Prompting.
Optimal reasoning token ratio exists but models cannot reach it. ZebraLogic's analysis of constraint satisfaction problems shows that there exists an optimal ratio of reasoning tokens to problem complexity (measured by Z3 solver conflicts). O1-like models scale reasoning tokens with complexity and approach this optimal ratio for moderate problems, but cannot reach it when complexity is extremely high — the reasoning effort ceiling is below what the problem requires. Self-verification prompting provides only marginal improvement (31.7% → 33.0% → 32.1% on second iteration), suggesting the bottleneck is not insufficient verification but insufficient reasoning depth. The optimal ratio finding quantifies the threshold: the sweet spot is not just "not too many tokens" but a specific relationship between problem difficulty and reasoning budget.
S1-Bench (2025) reveals that LRMs can prejudge question simplicity — especially in Chinese — but thinking length does NOT shorten despite this prejudgment. Models generate unnecessary solution rounds after reaching the correct answer, repeatedly reverifying simple problems already solved. Models with longer thinking processes produce more excessive solution rounds. Furthermore, LRMs sometimes include incorrect intermediate conclusions in their reasoning even when ultimately reaching correct final answers, and sometimes reach the correct answer during reasoning but then deviate to produce incorrect final conclusions. The prejudgment finding is architecturally important: it suggests the overthinking mechanism is not caused by inability to assess difficulty, but by an inability to act on that assessment — the model "knows" the problem is simple but cannot truncate its reasoning accordingly. Source: Arxiv/Evaluations.
S1-Bench's architectural deepening — difficulty is linearly probable from hidden states; the failure is action not perception. The full S1-Bench study (28 LRMs across multi-domain, multilingual model-simple questions) goes beyond the prejudgment-but-no-truncation observation. Using DS-R1-1.5B and DS-R1-7B as representative cases, a single-layer MLP trained on the final-layer hidden state of the last token in the encoded question predicts question difficulty with monotonically increasing accuracy as difficulty rises. The structure is already there — implicit, linear, decodable without specialized probes. Yet behaviorally, LRMs still produce redundant solution rounds with higher average token entropy on the same questions the probe correctly classifies as easy. The authors interpret this as architectural self-doubt: the model perceives simplicity, then second-guesses its own perception, leading to exploratory generation that overrides the implicit difficulty signal. This localizes the failure to the perception-to-action interface — not to representational capacity, not to difficulty assessment. The probe-vs-behavior gap is the diagnostic; it predicts that mechanistic interventions routing generation through the difficulty representation should outperform prompt-engineered "answer briefly" instructions, which target the wrong layer. Source: Reasoning Methods CoT ToT.
Inquiring lines that use this note as a source 143
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- What other hidden biases might aggregate metrics fail to distinguish from reasoning?
- Why do models commit to answers early on easy versus hard tasks?
- Does higher cognitive load on social media increase engagement?
- Why does scaling reasoning tokens fail to improve unfamiliar tasks?
- What distinguishes genuine reasoning activation from memorization-assisted answer recall?
- What is the critical thinking token threshold beyond which accuracy degrades?
- How do verbose and concise reasoning occupy different regions in activation space?
- When should action deliberation trigger during reasoning steps?
- What makes training-free approaches like Soft Thinking preferable to SoftCoT?
- Do high-influence thoughts align with SAND deliberation triggers?
- How do thinking tokens exhibit diminishing returns beyond a critical threshold?
- Can proactive critical thinking alone enable models to request clarification effectively?
- How does optimizing for accuracy during training degrade downstream reasoning quality?
- Does training for better reasoning reduce an AI system's ability to abstain?
- How does critique fine-tuning on one problem unlock broader reasoning?
- Does parallel thinking benefit disproportionately from higher inference throughput architectures?
- Does the timing of AI feedback relative to user reasoning change its effectiveness?
- How do we measure the cognitive flow cost of different intervention strategies?
- Do explicit reasoning chains improve or harm performance on complex judgment tasks?
- Why do top performers produce shorter chains of thought in their strongest domains?
- Do tokens beyond a critical threshold actually improve reasoning quality?
- Can extended thinking genuinely improve reasoning or just increase variance?
- Why do more capable models prefer shorter chains of thought?
- Can budget-tightening curricula improve reasoning efficiency more than fixed budgets?
- Why do models show performative reasoning on easy tasks but genuine reasoning on hard ones?
- Does explicit reasoning help or hurt tasks requiring continuous nuanced judgment?
- Why does fine-tuning degrade reasoning quality even as accuracy improves?
- Does thinking-token overuse actually degrade reasoning accuracy in practice?
- Why do models automatically adjust reasoning length to problem difficulty?
- What triggers overthinking versus underthinking in reasoning models?
- What determines the optimal thinking token threshold for a given task?
- Can parallel thinking outperform sequential thinking under the same token budget?
- Why does reasoning accuracy degrade beyond a critical thinking token threshold?
- How does difficulty level change whether extended thinking provides genuine reasoning signal?
- How do thinking tokens function as mutual information peaks in reasoning?
- When does explicit reasoning actually degrade performance on a task?
- How much reasoning catalyst data is actually needed for improvement?
- Do self-revision tokens measurably degrade reasoning accuracy in scaled models?
- Can extended reasoning training capture individual strategic thinking styles?
- Why do simple math problems get worse with longer reasoning chains?
- Why does step-by-step reasoning degrade performance on judgment-based tasks?
- How can judges evaluate thinking without seeing the actual thoughts?
- What role does confidence play in balancing overthinking versus underthinking?
- Does reasoning structure match explicit versus implicit task demands?
- How does evaluation format change what we measure about model reasoning?
- What reasoning token threshold marks the accuracy degradation point?
- How does overthinking in early turns degrade later retrieval rounds?
- Does chain-of-thought reasoning amplify bullshit or just make it more visible?
- Why does reasoning effort fail to improve theory of mind performance?
- Does distillation from reasoning models spread overthinking to smaller models?
- How does proactive critical thinking enable models to identify missing information?
- How should inference-time token budgets vary across models of different capability levels?
- Why does extended thinking increase output variance without improving reasoning quality?
- Are difficult tasks more monitorable because reasoning externalization becomes necessary?
- Why does parallel thinking outperform sequential thinking under the same token budget?
- Why does parallel thinking outperform sequential thinking with equal tokens?
- How much of a model's reasoning tokens are unnecessary for reaching the final answer?
- Does deep-thinking ratio measure computational effort better than chain-of-thought length?
- Can extended deliberation in agents become counterproductive like human overthinking?
- Does thought consolidation address the confirmatory reflection problem in reasoning models?
- Why does inference-time thinking hurt proactive critical thinking in vanilla models?
- What happens to reasoning accuracy when models use more thinking tokens?
- Why do reasoning models reduce effort despite having token budget remaining?
- Can models trained on longer contexts develop better fundamental reasoning abilities?
- Why does parallel thinking outperform sequential thinking under token limits?
- What makes reasoning models worse at understanding people?
- What three factors actually drive chain of thought performance improvements?
- How much does test-time compute improve reasoning without more tokens?
- Can training improve reasoning coherence without improving actual correctness?
- Why does revision often make reasoning accuracy worse in frontier models?
- What distinguishes coherent reasoning from inaccurate but plausible predictions?
- How do reasoning training methods sacrifice some thinking skills while improving others?
- How do reward models benefit from extended thinking during evaluation scoring?
- Can activation-space steering vectors replicate thinking model performance without retraining?
- How does chain-of-thought length affect attention to constraint tokens?
- Can a single correct example seed exponential improvement in mathematical reasoning?
- Do longer reasoning traces actually improve theory of mind accuracy?
- Does explicit reasoning help or hurt tasks requiring continuous judgment?
- Why does increasing reasoning not improve AI social reasoning performance?
- How do reasoning improvements suppress a model's ability to abstain?
- How does soft thinking compare to sampling multiple independent reasoning paths?
- Can reasoning evaluation metrics reward actual reasoning instead of theater?
- How does extended thinking affect variance in reasoning model outputs?
- Can intrinsic confidence signals improve both calibration and reasoning performance?
- When should a system choose extended thinking versus quick responses?
- How should timing for reasoning intervention be determined during inference?
- Why do some reasoning steps receive negligible attention from later steps?
- Does internal self-revision actually degrade reasoning accuracy in models?
- Why do benchmark scores rise while reasoning quality declines?
- Does chain-of-thought reasoning help or hurt social reasoning tasks?
- Does the thinking box provide genuine reasoning or just token budget?
- Do reasoning failures stem from strategy or from calculation breakdown?
- How much does extended thinking actually improve model reasoning ability?
- Does penalizing thought transitions improve reasoning without model retraining?
- How does reasoning accuracy degrade when token budgets exceed critical thresholds?
- Why does additional reasoning effort not improve theory of mind performance?
- Can reasoning scaffolds help with nuanced judgment tasks like empathy?
- Does the answer stage perform substantial reasoning beyond the thinking draft?
- Why does representation recycling of MI-peak tokens improve reasoning accuracy?
- Can thinking token density explain reasoning performance beyond total length?
- Can we improve reasoning by amplifying information at mutual information peaks?
- Do shorter correct reasoning traces contain more thought anchors than longer ones?
- Can structured prompts reduce reasoning steps while improving financial accuracy?
- Does reasoning training actively undermine the abstention capacity safety training created?
- Why does concise reasoning maintain accuracy with far fewer tokens?
- Can runtime confidence signals detect when reasoning has crossed the overthinking threshold?
- Does more thinking always improve language model accuracy?
- Does task difficulty alone determine how many thinking tokens a model should use?
- What happens to model reasoning accuracy as thinking token requirements exceed critical thresholds?
- Can operationalizing theory into prompt structure improve reasoning more than theory itself?
- What other internal model decisions beyond attention could be optimized directly?
- Why does reasoning volume fail to improve theory of mind performance?
- How does backward reasoning during training improve forward reasoning capability?
- What causes reasoning quality to degrade during long research tasks?
- Can benchmark improvements hide degradation of deliberative reasoning?
- Can thought quality alone be trusted to guide model training?
- What makes thinking tokens carry more information than other tokens?
- Can models internally identify which tokens matter most for reasoning?
- Can a single model implement fast thinking, slow thinking, and tool use?
- How much can externalized skills improve models before hitting diminishing returns?
- How do timing and search internalization interact during reasoning post-training?
- Can conditioning generation on difficulty probes reduce overthinking on simple tasks?
- What makes token-level reasoning during pretraining different from test-time chain-of-thought?
- Does chain-of-thought accuracy degrade with longer reasoning traces?
- Can format adaptation alone explain why reasoning enrichment improves instruction following?
- Why does uniform averaging across all tokens dilute the reasoning signal?
- Why does parallel thinking outperform sequential thinking under fixed token budgets?
- What inference-time scaling benefits emerge from reasoning before each prediction?
- Can distillation from stronger models create genuinely new reasoning abilities?
- Why do thinking models execute longer tasks than standard language models?
- How does active reasoning through interaction differ from passive single-turn problem solving?
- Can structured questioning prompts improve reasoning beyond standard conversational training?
- How do frontier models maintain agreement scores above 90 percent across reasoning tasks?
- What distinguishes metacognitive regulation from standard chain-of-thought reasoning?
- How do thought actions represent policy improvement steps in practice?
- What role does task structure play in rewarding delayed thinking?
- Can we predict when a model will develop thinking behaviors?
- Why does extended reasoning training improve exploration without adding new capabilities?
- Is premature decision-making a form of underthinking in transformer models?
- Why does reasoning backward enable better forward reasoning performance?
- Can indirect and direct reasoning methods be combined to improve results?
- Do models genuinely reason harder on difficult tasks or just appear to?
- How does early commitment in reasoning differ from early exploitation in planning?
Related concepts in this collection 11
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does extended thinking actually improve reasoning or just increase variance?
When models think longer, do they reason better, or do they simply sample from a wider distribution of outputs that happens to cover correct answers more often? This matters because it determines whether test-time compute is genuinely scaling reasoning capability.
the mechanistic explanation for why this threshold exists
-
Why does parallel reasoning outperform single chain thinking?
Does dividing a fixed token budget across multiple independent reasoning paths beat spending it all on one long chain? This explores how breadth and diversity in reasoning compare to depth.
the alternative strategy that avoids the overthinking trap
-
Why do correct reasoning traces contain fewer tokens?
In o1-like models, correct solutions are systematically shorter than incorrect ones for the same questions. This challenges assumptions that longer reasoning traces indicate better reasoning, and raises questions about what length actually signals.
supporting evidence from a different angle
-
Do reasoning models switch between ideas too frequently?
Research explores whether o1-like models abandon promising reasoning paths prematurely by switching to different approaches without sufficient depth, and whether penalizing such transitions could improve accuracy.
the complementary failure mode: insufficient depth per path, not just excessive total tokens
-
Can dialogue planning balance fast responses with strategic depth?
Can a system use quick instinctive responses for familiar conversation contexts while activating deeper planning only when uncertainty demands it? This explores whether adaptive computation improves dialogue goal-reaching.
DPDP naturally avoids the overthinking threshold by restricting deep search (MCTS) to genuinely uncertain contexts via System 1/2 switching
-
Do personality types shape how AI agents make strategic choices?
This research explores whether priming LLM agents with MBTI personality profiles causes them to adopt different strategic behaviors in games. Understanding this matters for designing AI systems optimized for specific tasks.
personality priming modulates reasoning depth: Introversion produces longer, more elaborated rationales, potentially lowering the threshold at which overthinking degrades accuracy; personality conditioning is an unexamined variable in test-time compute allocation
-
When should retrieval happen during model generation?
Explores whether retrieval should occur continuously, at fixed intervals, or only when the model signals uncertainty. Standard RAG retrieves once; long-form generation requires dynamic triggering based on confidence signals.
retrieval-level analog: just as reasoning tokens past the threshold harm accuracy, retrieval at every step regardless of confidence wastes context and introduces noise; both findings argue for uncertainty-gated resource allocation rather than fixed budgets
-
Do large language models use one reasoning style or many?
Explores whether LLMs share a universal strategic reasoning approach or develop distinct styles tailored to specific game types. Understanding this matters for predicting model behavior in competitive versus cooperative scenarios.
cross-domain confirmation: in strategic games, top performers produce shortest CoT in their strongest game types while DeepSeek-R1 exhibits "repeated self-doubt" loops in competitive games that inflate tokens without improvement — the overthinking threshold extends to interactive reasoning
-
Why do reasoning models struggle with theory of mind tasks?
Extended reasoning training helps with math and coding but not social cognition. We explore whether reasoning models can track mental states the way they solve formal problems, and what that reveals about the structure of social reasoning.
the overthinking threshold is categorically worse for social reasoning
-
Can we measure how deeply a model actually reasons?
What if reasoning quality isn't about length or confidence, but about how much a model's predictions shift across its internal layers? Can tracking these shifts reveal genuine thinking versus pattern-matching?
DTR explains the threshold mechanistically: tokens past the threshold should have low DTR (early layer stabilization = pattern-matching filler rather than genuine computation); Think@n provides a selection mechanism that avoids the overthinking region: reasoning effort shows zero or negative correlation with ToM performance, meaning extended thinking actively degrades social cognition rather than merely plateauing — social tasks may have a near-zero optimal thinking threshold
-
Can models recognize question difficulty before they reason?
Does reasoning language models encode implicit knowledge of problem difficulty in their hidden states, even before generating solution steps? And if so, why don't they act on this knowledge?
the architectural mechanism behind S1-Bench's prejudgment-but-no-truncation finding: difficulty is implicitly encoded but generation overrides it
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Does Thinking More always Help? Understanding Test-Time Scaling in Reasoning Models
- Mining Hidden Thoughts from Texts: Evaluating Continual Pretraining with Synthetic Data for LLM Reasoning
- Think Deep, Not Just Long: Measuring LLM Reasoning Effort via Deep-Thinking Tokens
- The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
- SSRL: Self-Search Reinforcement Learning
- Demystifying Reasoning Dynamics with Mutual Information: Thinking Tokens are Information Peaks in LLM Reasoning
- Rethinking Thinking Tokens: LLMs as Improvement Operators
- S1-Bench: A Simple Benchmark for Evaluating System 1 Thinking Capability of Large Reasoning Models
Original note title
reasoning accuracy degrades beyond a critical thinking-token threshold