INQUIRING LINE

Why do longer reasoning chains correlate with lower accuracy in o1-like models?

This explores why o1-style reasoning models often get *less* accurate as their chains of thought grow longer — and whether length itself is the cause or just a symptom of something deeper.


This explores why o1-style reasoning models often get *less* accurate as their chains of thought grow longer — and whether length itself is the cause or just a symptom. The corpus points to a counterintuitive answer: longer is rarely better, and the length is usually a *signal* of trouble rather than the trouble itself. Accuracy as a function of chain length traces an inverted-U — it climbs to an intermediate sweet spot, then falls — and that optimal point shrinks as models get more capable, so the strongest models actually prefer shorter chains Why does chain of thought accuracy eventually decline with length?. Strikingly, you can match verbose reasoning at a fraction of the cost: minimal "draft" chains hit equivalent accuracy on arithmetic and commonsense tasks using just 7.6% of the tokens, because most of the removed words were doing stylistic and documentation work, not computation Can minimal reasoning chains match full explanations?.

So what fills the extra tokens when chains run long and wrong? A lot of it is wasted motion. Reasoning models tend to *underthink* — they abandon promising solution paths mid-exploration and switch to new ones prematurely, burning tokens on half-finished approaches. A simple decoding-time penalty on thought-transition tokens curbs the switching and lifts accuracy with no retraining at all Do reasoning models switch between ideas too frequently?. The same picture shows up as two reinforcing failures — "wandering" (invalid exploration) and underthinking — that are structural disorganization, not a shortage of compute; the right answer was often reachable but got dropped Why do reasoning models abandon promising solution paths?. Length, in other words, frequently measures thrashing.

There's also a more mechanical reason long chains decay: every additional step is another place for an error to enter and propagate. Under manipulative multi-turn prompts, reasoning models drop 25–29% in accuracy precisely because extended chains create more corruption points where a single wrong step snowballs into a confident wrong conclusion Are reasoning models actually more vulnerable to manipulation?. More reasoning does dampen sensitivity to noisy inputs, but a robustness floor exists structurally — extra steps reduce perturbation but can never zero it out Can longer reasoning chains eliminate model sensitivity to input noise?. And token-level analysis finds that local memorization — predicting based on just-preceding tokens rather than genuine reasoning — accounts for up to 67% of errors, an effect that gets worse as complexity and distributional shift grow Where do memorization errors arise in chain-of-thought reasoning?.

Here's the part you might not expect: the length itself often isn't tracking problem difficulty at all. In controlled maze experiments, trace length correlates with difficulty only *inside* the training distribution and decouples completely outside it — long traces mostly reflect recalled training schemas, not adaptive computation on a hard problem Does longer reasoning actually mean harder problems?. That reframes the whole correlation: models don't fail at some complexity threshold, they fail at instance *novelty* boundaries, fitting memorized instance patterns rather than general algorithms Do language models fail at reasoning due to complexity or novelty?. When you push them off-distribution, chain-of-thought degrades predictably, producing fluent-but-logically-inconsistent reasoning — the *form* of thinking without the validity Does chain-of-thought reasoning actually generalize beyond training data?.

Two more notes that widen the picture. Even raw input length hurts before context limits are anywhere near full — accuracy falls from 92% to 68% with just 3,000 tokens of padding, and chain-of-thought prompting doesn't rescue it Does reasoning ability actually degrade with longer inputs?. And when problems genuinely demand sustained long-chain reflection and backtracking, frontier models like o1-preview and DeepSeek-R1 hit a ceiling around 20–23% on constraint-satisfaction tasks — fluent reflection doesn't convert into real problem-solving on unfamiliar structures Can reasoning models actually sustain long-chain reflection?. The takeaway: long chains correlate with low accuracy because length is usually a symptom — of off-distribution recall, premature path-switching, and accumulating error — not a dial you can turn up to think harder.


Sources 12 notes

Why does chain of thought accuracy eventually decline with length?

Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.

Can minimal reasoning chains match full explanations?

Chain of Draft achieves equivalent accuracy to standard chain-of-thought on arithmetic, symbolic, and commonsense tasks while using only 7.6% of tokens. The 92.4% of removed tokens served style and documentation, not computation.

Do reasoning models switch between ideas too frequently?

o1-like models frequently abandon reasoning paths mid-exploration, wasting tokens on incomplete approaches. A decoding-only penalty on thought-transition tokens (TIP strategy) discourages switching, improving accuracy on challenging math without model fine-tuning.

Why do reasoning models abandon promising solution paths?

Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.

Are reasoning models actually more vulnerable to manipulation?

GaslightingBench-R shows that multi-turn manipulative prompts reduce reasoning model accuracy significantly more than standard models. Extended chains create more corruption points, allowing single wrong steps to propagate into confident incorrect conclusions.

Can longer reasoning chains eliminate model sensitivity to input noise?

Lipschitz continuity analysis proves that while additional reasoning steps reduce perturbation propagation, a non-zero robustness floor exists structurally. Sensitivity decreases with stronger embedding and hidden state norms but never reaches zero.

Where do memorization errors arise in chain-of-thought reasoning?

STIM framework identifies local, mid-range, and long-range memorization sources in CoT reasoning. Local memorization—based on preceding tokens—accounts for up to 67% of reasoning errors, especially as complexity increases and distributional shift occurs.

Does longer reasoning actually mean harder problems?

Controlled A* maze experiments show trace length correlates with difficulty only in-distribution but decouples entirely out-of-distribution. Trace length primarily reflects recall of training schemas, not adaptive computation.

Do language models fail at reasoning due to complexity or novelty?

LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

Does reasoning ability actually degrade with longer inputs?

FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.

Can reasoning models actually sustain long-chain reflection?

DeepSeek-R1 and o1-preview achieve only 20-23.6% exact match on 850 constraint satisfaction problems requiring genuine backtracking. This ceiling reveals that reflective reasoning fluency does not translate to actual problem-solving competence on unfamiliar instance structures.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a reasoning-systems researcher. The question remains open: why do longer reasoning chains correlate with lower accuracy in o1-like models — and is length itself causal, or a symptom of deeper failures?

What a curated library found — and when (dated claims, not current truth):
Findings span Feb 2024–Feb 2026. A library of recent work identifies these constraints:
• Optimal chain-of-thought length follows an inverted-U; more capable models prefer *shorter* chains, and minimal "draft" reasoning hits equivalent accuracy at 7.6% of verbose token cost (2025–2026).
• Models underthink: they abandon promising paths prematurely, and penalizing thought-transition tokens at decode time lifts accuracy without retraining (2025-01).
• Every reasoning step adds corruption points; under adversarial multi-turn prompts, accuracy drops 25–29% because errors propagate confidently down longer chains (2025-06).
• Chain length decouples from problem difficulty outside training distribution; long traces mostly reflect memorized schemas, not adaptive computation; off-distribution, CoT degrades predictably (2025-08, 2025-09).
• Raw input padding (3,000 tokens) crashes accuracy from 92% to 68% before context limits are stressed; constraint-satisfaction tasks hit ~20–23% ceilings even for o1-preview (2025-02).

Anchor papers (verify; mind their dates):
• arXiv:2501.18585 (Underthinking of o1-Like LLMs, Jan 2025)
• arXiv:2508.01191 (Chain-of-Thought as Mirage via Data Distribution, Aug 2025)
• arXiv:2506.09677 (Reasoning Models Gaslighting, Jun 2025)
• arXiv:2502.07266 (When More is Less, Feb 2025)

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above — inverted-U optimality, underthinking penalties, error propagation, distribution-bound collapse, constraint-satisfaction ceilings — judge whether *newer models, decoding strategies (speculative, tree-search, rollout), training methods (process reward models, synthetic preference data), or evaluation harnesses* since Feb 2026 have relaxed or overturned any claim. Separate the durable question (likely still open: why does length signal trouble?) from perishable limitations (possibly solved: can decoding-time penalties or multi-agent rollout rescue long reasoning?). Cite what resolved it.
(2) Surface the strongest *contradicting or superseding* work from the last ~6 months — any paper showing long chains *do* help under specific conditions, or that newer o1-variants (o2, DeepSeek-R2, others) have inverted the correlation.
(3) Propose two research questions that *assume the regime may have moved*: (a) If decoding-time or training-time intervention has made length less of a trap, what structural property of models now determines depth-vs.-breadth tradeoffs? (b) If the core issue is off-distribution collapse, can curriculum or adversarial fine-tuning recover length-robustness?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines