Why does scaling reasoning tokens fail to improve unfamiliar tasks?
This explores why piling on more reasoning tokens — longer chains of thought, more 'thinking' time — stops helping the moment a task drifts outside what the model has seen, rather than continuing to scale.
This explores why piling on more reasoning tokens stops helping once a task is unfamiliar — and the corpus points to a blunt answer: reasoning models aren't running a general algorithm that more steps can extend, they're pattern-matching to instances they've already seen. The sharpest version of this is the finding that reasoning breakdowns track instance-level unfamiliarity, not task complexity Do language models fail at reasoning due to complexity or novelty?. A model will nail a long, hard-looking chain if it was trained on similar instances, and stumble on a short, easy one that happens to be novel. Length was never the bottleneck; novelty is. So adding tokens to an unfamiliar problem just produces more of the wrong thing.
That reframes what a chain of thought even *is*. Several notes converge on the unsettling idea that the visible reasoning is closer to scaffolding than to logic. Models trained on deliberately corrupted, irrelevant traces keep their accuracy — and sometimes generalize *better* — which means the trace functions as computational structure, not meaningful steps Do reasoning traces need to be semantically correct?. The DataAlchemy experiments make the failure mode explicit: chain-of-thought degrades predictably under shifts in task, length, or format, producing fluent reasoning that imitates the *form* without the underlying validity Does chain-of-thought reasoning actually generalize beyond training data?. If the surface form is what's being reproduced, then on an unfamiliar task you get confident-sounding nonsense, and more tokens scale the nonsense.
There's also a ceiling even on familiar ground. Accuracy isn't monotonic in thinking length — pushing tokens from ~1,100 to ~16K dropped benchmark accuracy from 87% to 70%, because models overthink easy problems and underthink hard ones Does more thinking time always improve reasoning accuracy?. And the learning signal is carried by a tiny minority of tokens: only ~20% are high-entropy 'forking points' where reasoning actually branches Do high-entropy tokens drive reasoning model improvements?, with specific reflection tokens like 'Wait' and 'Therefore' spiking in mutual information with the right answer Do reflection tokens carry more information about correct answers?. Most added tokens aren't doing decisional work — so volume is the wrong lever.
A useful counter-thread: not every collapse is a *reasoning* limit. Some are execution limits. Text-only models that demonstrably *know* an algorithm still fail to run it across many steps, and the same models clear the supposed 'reasoning cliff' once given tools Are reasoning model collapses really failures of reasoning?. That's a different unfamiliarity — procedural bandwidth, not conceptual novelty — and it tells you when more tokens would help (give it a calculator instead) versus when they won't (the pattern simply isn't there).
So what *does* move an unfamiliar task? Not raw token budget, but signal that token budget lacks. Numerical-reward training plateaus because a scalar reward can't say *why* an attempt failed; natural-language critiques break exactly those plateaus, letting stuck models produce correct solutions Can natural language feedback overcome numerical reward plateaus?. The thing you didn't know you wanted to know: scaling reasoning tokens is scaling *retrieval of a learned pattern*, and you can't retrieve a pattern that was never stored — which is why the frontier work has shifted from 'think longer' to changing the *kind* of signal the model gets, or offloading execution entirely.
Sources 8 notes
LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.
Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.
DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.
Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.
Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.
Specific tokens like "Wait" and "Therefore" show sharp spikes in mutual information with correct answers. Suppressing them harms reasoning while suppressing equal random tokens does not, and representation recycling improves accuracy 20%.
Models confined to text-only generation cannot execute multi-step procedures at scale, even when they know the underlying algorithm. Tool-enabled models solve problems beyond the supposed reasoning cliff, suggesting the bottleneck is procedural execution bandwidth.
Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.