Why do reasoning models fail when input length increases even below context limits?

This explores why models that are supposed to be good at reasoning lose accuracy as the input gets longer — even when the input is nowhere near filling the context window, so it isn't a simple 'ran out of room' problem.

This explores why reasoning models degrade as inputs grow longer, well before the context window is actually full — which rules out the obvious 'ran out of room' explanation and points at something deeper. The starting point is the measurement itself: accuracy can fall from 92% to 68% with just 3,000 tokens of padding, far below capacity, and the drop is task-agnostic, uncorrelated with how well the model predicts text, and survives chain-of-thought prompting Does reasoning ability actually degrade with longer inputs?. So the model isn't 'forgetting' the relevant facts — they're right there in the prompt — yet performance still slides. That's the puzzle worth sitting with.

Several notes in the corpus converge on a striking reframe: the problem may not be length at all, but what length is standing in for. One line of work argues failures track *instance-level unfamiliarity* rather than complexity — models fit patterns from training instances rather than running a general algorithm, so a longer or differently-shaped problem fails not because it's harder but because it's less familiar Do language models fail at reasoning due to complexity or novelty?. A related finding recasts apparent 'reasoning collapses' as *execution* failures: text-only models can know the right algorithm but lack the bandwidth to carry it out step-by-step at scale, and giving them tools lets them solve problems past the supposed cliff Are reasoning model collapses really failures of reasoning?. Longer inputs tax exactly that procedural-execution budget.

There's also a structural, almost mathematical floor here. A Lipschitz-continuity analysis shows that more reasoning steps *dampen* the propagation of input noise but can never drive sensitivity to zero — there's a non-zero robustness floor baked into the architecture Can longer reasoning chains eliminate model sensitivity to input noise?. More padding means more surface for that residual sensitivity to act on. And the models reason through semantic association rather than formal symbol manipulation, so when extra content shifts or dilutes the semantic signal, the 'reasoning' has nothing stable to hold onto Do large language models reason symbolically or semantically?.

The failure also shows up in how models *use* their own reasoning chains. Optimal chain-of-thought length follows an inverted-U — too short or too long both hurt, and the sweet spot shifts with task and model Why does chain of thought accuracy eventually decline with length?. Longer inputs push models toward longer chains, where two reinforcing pathologies appear: 'wandering' into invalid branches and 'underthinking' by abandoning promising paths too early — disorganization, not lack of compute, since decoding-level nudges recover accuracy Why do reasoning models abandon promising solution paths?. On problems that genuinely require sustained backtracking, frontier models stall at 20-23% exact match, revealing that fluent-looking reflection doesn't convert into competence on unfamiliar structures Can reasoning models actually sustain long-chain reflection?.

The quietly surprising takeaway: 'long input' may be the wrong variable to blame. The corpus suggests at least one camp thinks the real bottleneck is *compute to consolidate context into internal state*, not memory or length per se — performance improves with more consolidation passes, a test-time-scaling pattern Is long-context bottleneck really about memory or compute?. If that's right, the fix isn't bigger context windows but teaching models when to engage heavy reasoning versus respond directly Can models learn when to think versus respond quickly?, and when to disengage entirely — since today's models can't even reject an ill-posed question, over-reasoning past the point of usefulness Why do reasoning models overthink ill-posed questions?.

Sources 11 notes

Does reasoning ability actually degrade with longer inputs?

FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.

Do language models fail at reasoning due to complexity or novelty?

LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.

Are reasoning model collapses really failures of reasoning?

Models confined to text-only generation cannot execute multi-step procedures at scale, even when they know the underlying algorithm. Tool-enabled models solve problems beyond the supposed reasoning cliff, suggesting the bottleneck is procedural execution bandwidth.

Can longer reasoning chains eliminate model sensitivity to input noise?

Lipschitz continuity analysis proves that while additional reasoning steps reduce perturbation propagation, a non-zero robustness floor exists structurally. Sensitivity decreases with stronger embedding and hidden state norms but never reaches zero.

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Why does chain of thought accuracy eventually decline with length?

Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.

Why do reasoning models abandon promising solution paths?

Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.

Can reasoning models actually sustain long-chain reflection?

DeepSeek-R1 and o1-preview achieve only 20-23.6% exact match on 850 constraint satisfaction problems requiring genuine backtracking. This ceiling reveals that reflective reasoning fluency does not translate to actual problem-solving competence on unfamiliar instance structures.

Is long-context bottleneck really about memory or compute?

Research shows the bottleneck is not memory capacity but the compute required to consolidate evicted context into fast weights during offline sleep phases. Performance improves with more consolidation passes, following a test-time scaling pattern on harder reasoning tasks.

Can models learn when to think versus respond quickly?

Thinkless trains a single model to select between extended reasoning and direct responses using DeGRPO, which decouples mode selection from answer refinement. This prevents mode collapse and enables self-calibrated routing without explicit difficulty labels.

Why do reasoning models overthink ill-posed questions?

Reasoning models generate redundant, lengthy responses to questions with missing premises while non-reasoning models correctly identify them as unanswerable. Training optimizes for producing reasoning steps but never teaches models when to disengage.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a reasoning-capability analyst. The durable question: why do reasoning models degrade as inputs grow longer, even well below context window limits — and has this constraint shifted?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026. Key constraints reported:
• Accuracy falls 92%→68% with just 3,000 tokens of padding, far below capacity (2024-02).
• Instance-level unfamiliarity, not task complexity, drives reasoning breakdown; models fit training patterns rather than execute general algorithms (2025-02).
• Execution failures, not reasoning collapse: text-only models lack procedural bandwidth to scale step-by-step; tool use bypasses the cliff (2025-02).
• Optimal chain-of-thought follows an inverted-U; longer inputs push toward longer chains, triggering 'wandering' and 'underthinking' (2025-05).
• The real bottleneck may be *compute to consolidate context into internal state*, not memory or raw length (2025-06, implied).
• Models cannot reject ill-posed questions; they overthink past usefulness (2025-05).

Anchor papers (verify; mind their dates):
• arXiv:2402.14848 (2024-02): Same Task, More Tokens — direct measurement of length penalty.
• arXiv:2502.07266 (2025-02): When More is Less — inverted-U in chain-of-thought length.
• arXiv:2505.20296 (2025-05): Reasoning LLMs are Wandering Solution Explorers — mechanism of degradation.
• arXiv:2509.21284 (2025-09): Bounds of Chain-of-Thought Robustness — theoretical floor on sensitivity.

Your task:
(1) RE-TEST EACH CONSTRAINT. For each claim above, judge whether newer models (o3, o4, or equivalents), in-context learning methods, retrieval augmentation, or multi-step inference tools have since RELAXED or OVERTURNED it. Separate the durable question (likely still open) from the perishable limitation (possibly solved by architectural or training shifts). Cite what resolved it, and flag where a constraint still holds.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last 6 months — work that either refutes the 'wandering/underthinking' narrative or shows that longer inputs no longer degrade under certain conditions.
(3) Propose 2 research questions that ASSUME the regime may have moved: e.g., *Is the length penalty now a knowledge-base problem rather than a reasoning problem?* or *Do models trained with synthetic long-reasoning data or multi-turn scaffolding escape the inverted-U?*

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Why do reasoning models fail when input length increases even below context limits?

Sources 11 notes

Next inquiring lines