Does more thinking time actually improve LLM reasoning?
The intuition that extended thinking helps LLMs reason better seems obvious, but what does the empirical data actually show when we test it directly?
The "more thinking = better reasoning" assumption drives major product and research decisions — model releases tout extended thinking modes, inference infrastructure is built around longer traces, researchers benchmark scaling behavior assuming monotonic improvement. But the assumption is directly falsifiable with a controlled experiment, and the data falsifies it.
From ~1,100 to ~16,000 thinking tokens: accuracy drops from 87.3% to 70.3%. The relationship is non-monotonic. Beyond a threshold, more tokens actively hurt.
What makes this a myth rather than just an approximation: it's not that the assumption is wrong at the edges. It's that the assumption was never justified by evidence — it was inferred from partial data (the improving phase of the curve, before the critical point) and then treated as a general truth. The full curve was hidden in plain sight.
The myth persists partly because it maps onto how we think about human reasoning: more reflection should produce better answers. But LLM reasoning traces aren't human reflection. They're stochastic sequences where entropy (variance) and quality (correctness) are different dimensions. Conflating them is a category error. Why do LLMs generate more novel research ideas than experts? shows the same error running in the opposite direction: the intuition that LLMs fall short on creative originality also gets empirically reversed — LLMs generate more novel research ideas than human experts, but lack the evaluative capacity to select good ones. Same structure: cognition-imported intuition meets data, intuition loses.
Post-worthy angle: the overthinking finding is a case study in how intuitions about human cognition, imported uncritically into AI evaluation, generate systematic errors in how we build and measure these systems.
The NoThinking finding adds a sharper falsification at the model level: Even within reasoning models, bypassing the explicit thinking process entirely (NoThinking — forcing the thinking box to be empty) outperforms standard thinking across 7 diverse reasoning datasets when token count is controlled. The performance advantage of reasoning models may come partly from the token budget itself rather than from the structured thinking process. If NoThinking matches or beats Thinking at equal tokens, the thinking box is not doing uniquely valuable work — it may be providing a space to generate tokens that helps the model reach answers, rather than implementing a genuine reasoning process.
AbstentionBench adds a third dimension to this falsification: reasoning fine-tuning doesn't just produce diminishing token-level returns; it actively degrades calibration, reducing abstention rates by 24%. The "more thinking" myth operates at two timescales — inference-time (more tokens hurt past threshold) and training-time (reasoning fine-tuning hurts epistemic calibration). The cost of optimizing for reasoning performance is paid not just in overthinking but in lost capacity to recognize the limits of that reasoning. Does reasoning fine-tuning make models worse at declining to answer? documents this training-time dimension.
Inquiring lines that use this note as a source 12
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- What distinguishes LLM fabrication from genuine theoretical reasoning?
- Can extended thinking genuinely improve reasoning or just increase variance?
- When should an LLM engage extended reasoning versus responding directly?
- How does difficulty level change whether extended thinking provides genuine reasoning signal?
- Do LLMs actually reason differently than humans about moral dilemmas?
- Why does extended thinking increase output variance without improving reasoning quality?
- Can LLMs reflect on and revise their own ethical contradictions?
- Can LLM judges be trained to think more rigorously during evaluation?
- How does extended thinking affect variance in reasoning model outputs?
- When should a system choose extended thinking versus quick responses?
- How much does extended thinking actually improve model reasoning ability?
- Can extended thinking modes introduce genuine rhetorical exploration to LLMs?
Related concepts in this collection 6
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does more thinking time always improve reasoning accuracy?
Explores whether extending a model's thinking tokens linearly improves performance, or if there's a point beyond which additional reasoning becomes counterproductive.
the empirical finding
-
Does extended thinking actually improve reasoning or just increase variance?
When models think longer, do they reason better, or do they simply sample from a wider distribution of outputs that happens to cover correct answers more often? This matters because it determines whether test-time compute is genuinely scaling reasoning capability.
why the myth persists (it's not entirely wrong, just wrong past the threshold)
-
Why do LLMs generate more novel research ideas than experts?
LLM-generated research ideas are statistically more novel than those from 100+ expert researchers, but the mechanisms behind this advantage and its practical implications remain unclear. Understanding this paradox could reshape how we use AI in creative knowledge work.
parallel structure: another cognitive intuition empirically reversed; intuition says LLMs lack creativity, data says they exceed humans in novelty
-
Do reasoning traces actually cause correct answers?
Explores whether the intermediate 'thinking' tokens in R1-style models genuinely drive reasoning or merely mimic its appearance. Matters because false confidence in invalid traces could mask errors.
names the source of the myth: importing the human intuition that reflective thinking improves answers conflates surface mimicry with genuine reasoning; the myth is sustained by trace anthropomorphism even after empirical falsification
-
Is reflection in reasoning models actually fixing mistakes?
Do the thinking steps that appear after a model's first answer represent genuine self-correction, or are they mostly confirming what the model already concluded? Understanding this matters for how we train and deploy reasoning systems.
sharpens the falsification: not only does extended thinking fail past a threshold, the reflection stage of reasoning models is mostly confirmatory theater that does not change the first answer; the myth is doubly wrong — extended thinking does not help and the reflection that would justify it is performative
-
What makes reflection actually work in reasoning models?
Does reflection in language models involve genuine self-correction, or just confident-sounding traces? This question probes whether models can truly backtrack and revise versus merely mimicking reflective language.
replaces the bad metric: the myth treats chain length as a proxy for reasoning quality; reflection capability counting (assumption, backtracking, self-refinement) is the proper unit, and most chains contain few of these despite their length
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Mining Hidden Thoughts from Texts: Evaluating Continual Pretraining with Synthetic Data for LLM Reasoning
- Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs
- Towards Large Reasoning Models: A Survey on Scaling LLM Reasoning Capabilities
- Think Deep, Not Just Long: Measuring LLM Reasoning Effort via Deep-Thinking Tokens
- The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
- Reasoning Models Can Be Effective Without Thinking
- Beyond the Last Answer: Your Reasoning Trace Uncovers More than You Think
- LLMs can implicitly learn from mistakes in-context
Original note title
the more thinking is always better assumption is llms most testable falsifiable myth