SYNTHESIS NOTE
Reasoning, Retrieval, and Evaluation Psychology, Society, and Alignment

Does more thinking time actually improve LLM reasoning?

The intuition that extended thinking helps LLMs reason better seems obvious, but what does the empirical data actually show when we test it directly?

Synthesis note · 2026-02-20 · sourced from Test Time Compute
How should we allocate compute budget at inference time?

The "more thinking = better reasoning" assumption drives major product and research decisions — model releases tout extended thinking modes, inference infrastructure is built around longer traces, researchers benchmark scaling behavior assuming monotonic improvement. But the assumption is directly falsifiable with a controlled experiment, and the data falsifies it.

From ~1,100 to ~16,000 thinking tokens: accuracy drops from 87.3% to 70.3%. The relationship is non-monotonic. Beyond a threshold, more tokens actively hurt.

What makes this a myth rather than just an approximation: it's not that the assumption is wrong at the edges. It's that the assumption was never justified by evidence — it was inferred from partial data (the improving phase of the curve, before the critical point) and then treated as a general truth. The full curve was hidden in plain sight.

The myth persists partly because it maps onto how we think about human reasoning: more reflection should produce better answers. But LLM reasoning traces aren't human reflection. They're stochastic sequences where entropy (variance) and quality (correctness) are different dimensions. Conflating them is a category error. Why do LLMs generate more novel research ideas than experts? shows the same error running in the opposite direction: the intuition that LLMs fall short on creative originality also gets empirically reversed — LLMs generate more novel research ideas than human experts, but lack the evaluative capacity to select good ones. Same structure: cognition-imported intuition meets data, intuition loses.

Post-worthy angle: the overthinking finding is a case study in how intuitions about human cognition, imported uncritically into AI evaluation, generate systematic errors in how we build and measure these systems.

The NoThinking finding adds a sharper falsification at the model level: Even within reasoning models, bypassing the explicit thinking process entirely (NoThinking — forcing the thinking box to be empty) outperforms standard thinking across 7 diverse reasoning datasets when token count is controlled. The performance advantage of reasoning models may come partly from the token budget itself rather than from the structured thinking process. If NoThinking matches or beats Thinking at equal tokens, the thinking box is not doing uniquely valuable work — it may be providing a space to generate tokens that helps the model reach answers, rather than implementing a genuine reasoning process.

AbstentionBench adds a third dimension to this falsification: reasoning fine-tuning doesn't just produce diminishing token-level returns; it actively degrades calibration, reducing abstention rates by 24%. The "more thinking" myth operates at two timescales — inference-time (more tokens hurt past threshold) and training-time (reasoning fine-tuning hurts epistemic calibration). The cost of optimizing for reasoning performance is paid not just in overthinking but in lost capacity to recognize the limits of that reasoning. Does reasoning fine-tuning make models worse at declining to answer? documents this training-time dimension.

Inquiring lines that use this note as a source 12

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 6

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
24 direct connections · 226 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

the more thinking is always better assumption is llms most testable falsifiable myth