Can models trained on longer contexts develop better fundamental reasoning abilities?
This reads the question as: does training a model to handle more text — longer inputs, longer chains of thought — actually make it reason better, or just reason more? The corpus answers with a fairly clean 'no, length is the wrong lever' — and points to what the right levers are.
This explores whether stretching a model's context — feeding it longer inputs or letting it generate longer reasoning chains — produces deeper reasoning, or merely more of it. The corpus pushes back hard on the intuition that more length means more thinking power, and in doing so it reframes where reasoning actually comes from.
Start with raw input length. Reasoning accuracy doesn't hold up as inputs grow — it drops from 92% to 68% with just 3,000 tokens of padding, far below the context window's nominal limit, and the degradation is task-agnostic and survives chain-of-thought prompting Does reasoning ability actually degrade with longer inputs?. So a model that's been trained to *swallow* more context isn't thereby trained to *reason over* it. The same story repeats on the output side: piling on thinking tokens (from ~1,100 to ~16K) dragged accuracy from 87% down to 70% Does more thinking time always improve reasoning accuracy?, and optimal chain-of-thought length follows an inverted-U — more capable models actually prefer *shorter* chains, with brevity emerging from reward signals as the model improves Why does chain of thought accuracy eventually decline with length?. Length, in other words, is something good reasoning sheds, not something it accumulates.
If not length, then what? Two notes relocate the source of reasoning to training and pretraining rather than context size. Base models already contain the latent reasoning capability — five independent methods all *elicit* what's there rather than installing anything new, so the bottleneck is elicitation, not acquisition Do base models already contain hidden reasoning ability?. And what feeds that latent capability isn't long exposure but the *kind* of data: broad, transferable procedural knowledge spread across diverse pretraining documents drives reasoning generalization, in contrast to factual recall which leans on narrow memorization Does procedural knowledge drive reasoning more than factual retrieval?. Reasoning grows from the texture of what a model learned, not the size of the window it learned to read.
The most direct rebuttal to the question comes from the compute angle: non-reasoning models can't be made to match reasoning models simply by spending more inference budget, because training instills a *protocol* that makes the extra tokens productive in the first place Can non-reasoning models catch up with more compute?. The same mechanism — extended thinking — is counterproductive self-doubt in a vanilla model and beneficial gap-analysis after RL training; training mediates the *quality* of reasoning, not just the *quantity* Does extended thinking help or hurt model reasoning?. More room to think only helps if the model has been taught how to use it.
The twist worth taking away: longer chains and longer contexts can actively manufacture *fluent but invalid* reasoning. Chain-of-thought degrades predictably outside its training distribution, producing logically inconsistent output that imitates the *form* of reasoning without the logic Does chain-of-thought reasoning actually generalize beyond training data?, and reasoning models will happily generate long, redundant answers to ill-posed questions a non-reasoning model correctly rejects as unanswerable Why do reasoning models overthink ill-posed questions?. The frontier the corpus points toward isn't longer — it's *knowing when to stop*: models that route between thinking and answering directly Can models learn when to think versus respond quickly?, or that isolate reasoning operations into modular tool calls to elicit capability without any extra training at all Can modular cognitive tools unlock reasoning without training?. Better reasoning looks less like a bigger context and more like better-calibrated restraint.
Sources 11 notes
FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.
Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.
Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.
Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.
Vanilla models use thinking mode counterproductively, inducing self-doubt that degrades performance. RL training reverses this, transforming the same mechanism into beneficial gap analysis. Training mediates reasoning quality, not just quantity.
DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.
Reasoning models generate redundant, lengthy responses to questions with missing premises while non-reasoning models correctly identify them as unanswerable. Training optimizes for producing reasoning steps but never teaches models when to disengage.
Thinkless trains a single model to select between extended reasoning and direct responses using DeGRPO, which decouples mode selection from answer refinement. This prevents mode collapse and enables self-calibrated routing without explicit difficulty labels.
Four cognitive tools implemented as sandboxed LLM calls improved GPT-4.1 on AIME2024 from 26.7% to 43.3% without any RL training. Modularity enforces operation isolation that pure prompting cannot guarantee, eliciting pre-existing reasoning capability.