Why do more capable models prefer shorter chains of thought?
This explores why stronger models tend to reason in shorter chains — and whether that brevity is a sign of skill, a quirk of training rewards, or a hint that the visible reasoning was never doing the work we assumed.
This explores why stronger models tend to reason in shorter chains — and the corpus turns it into a more interesting puzzle than a simple "smarter = more efficient" story. The most direct answer is that there's a sweet spot: accuracy follows an inverted-U as reasoning gets longer, peaking at some intermediate length and then declining. That optimal length stretches out for harder tasks but shrinks as the model gets more capable, so a stronger model simply needs fewer steps to land in its accuracy peak Why does chain of thought accuracy eventually decline with length?. Crucially, nobody trains the model to be terse — reinforcement learning drifts toward shorter chains on its own as the model improves, meaning brevity emerges from the reward signal rather than being explicitly taught.
What's striking is how much of a long chain turns out to be non-computational. One study strips reasoning down to minimal drafts and matches full chain-of-thought accuracy using just 7.6% of the tokens — the other 92% served style and documentation, not the actual computation Can minimal reasoning chains match full explanations?. And verbosity itself seems to be a single steerable direction in the model's activation space: a vector pulled from 50 examples cuts chain length by two-thirds without hurting accuracy Can we steer reasoning toward brevity without retraining?. If conciseness lives on one dial, a capable model converging toward it looks less like a discovery and more like settling into a region it can already represent.
The overthinking penalty is real, not just wasteful. Push thinking tokens from ~1,100 to ~16K and benchmark accuracy can collapse from 87% to 70% — models overthink easy problems and the extra deliberation actively corrodes correct answers Does more thinking time always improve reasoning accuracy?. There's a mechanistic flavor to this: untrained models use extended thinking counterproductively, talking themselves into self-doubt, and RL training reverses that — turning the same machinery from second-guessing into useful gap analysis Does extended thinking help or hurt model reasoning?. So a capable model's short chain may reflect that it no longer needs to argue itself out of a corner.
Here's the part you didn't know you wanted to know: trace length may not be measuring difficulty at all. Controlled maze experiments show chain length tracks problem difficulty only when problems resemble training data — out of distribution, the correlation vanishes entirely. Length mostly reflects how well the model is recalling a familiar schema, not how much it's adaptively computing Does longer reasoning actually mean harder problems?. A capable model produces short chains partly because more of the world looks familiar to it. This reframes the whole question — and it gets sharper alongside evidence that fine-tuning makes reasoning steps less causally connected to the final answer, so the chain becomes performative rather than functional Does fine-tuning disconnect reasoning steps from final answers?, and that models can scale reasoning entirely in latent space without verbalizing anything, suggesting the written-out chain is a training artifact rather than a requirement Can models reason without generating visible thinking tokens?.
For the adjacent territory: rather than always reasoning short or long, models can learn to route — picking extended thinking or a direct answer per problem without difficulty labels Can models learn when to think versus respond quickly?. And brevity isn't free everywhere: longer chains do measurably dampen sensitivity to noisy inputs (though never to zero) Can longer reasoning chains eliminate model sensitivity to input noise?, and in multimodal perception tasks verbose reasoning actively hurts because it optimizes the wrong bottleneck entirely Does verbose chain-of-thought actually help multimodal perception tasks?. Taken together, the corpus suggests "more capable models prefer shorter chains" is real but the cause is layered — part genuine efficiency, part reward-driven convergence, and part a clue that chain length was telling us about familiarity and presentation more than about thinking itself.
Sources 11 notes
Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.
Chain of Draft achieves equivalent accuracy to standard chain-of-thought on arithmetic, symbolic, and commonsense tasks while using only 7.6% of tokens. The 92.4% of removed tokens served style and documentation, not computation.
Activation-Steered Compression extracts a single vector from 50 paired examples to reduce chain-of-thought length by 67% while maintaining accuracy and achieving 2.73x speedup. The method is training-free and generalizes across model sizes and domains.
Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.
Vanilla models use thinking mode counterproductively, inducing self-doubt that degrades performance. RL training reverses this, transforming the same mechanism into beneficial gap analysis. Training mediates reasoning quality, not just quantity.
Controlled A* maze experiments show trace length correlates with difficulty only in-distribution but decouples entirely out-of-distribution. Trace length primarily reflects recall of training schemas, not adaptive computation.
Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.
Multiple architectures—depth-recurrent models, Heima, and Coconut—demonstrate that test-time compute scales through hidden state iteration rather than token generation. This suggests verbalization is a training artifact, not a reasoning requirement.
Thinkless trains a single model to select between extended reasoning and direct responses using DeGRPO, which decouples mode selection from answer refinement. This prevents mode collapse and enables self-calibrated routing without explicit difficulty labels.
Lipschitz continuity analysis proves that while additional reasoning steps reduce perturbation propagation, a non-zero robustness floor exists structurally. Sensitivity decreases with stronger embedding and hidden state norms but never reaches zero.
Long rationales and text-token RL help reasoning but hurt fine-grained perception tasks because the actual bottleneck is visual attention allocation, not verbalization. Standard CoT optimization trains the wrong policy target.