How much does test-time compute improve reasoning without more tokens?
This explores whether models can get better at reasoning by spending more compute *internally* — iterating in hidden states or rearranging where the work happens — rather than by generating more visible thinking tokens.
This explores whether models can get better at reasoning by spending more compute internally rather than by writing out longer chains of thought — and the corpus suggests the link between 'more tokens' and 'better reasoning' is weaker than it looks. The cleanest version of the idea is latent reasoning: architectures like depth-recurrent models, Heima, and Coconut scale test-time compute by iterating on hidden states, producing no extra visible tokens at all Can models reason without generating visible thinking tokens?. That this works hints that verbalization is a training artifact, not a requirement for reasoning. A related finding shows transformers already do this involuntarily — they compute the correct answer in layers 1–3, then overwrite it to emit format-compliant filler Do transformers hide reasoning before producing filler tokens?. The reasoning isn't in the tokens; the tokens are downstream of it.
The flip side is that piling on more tokens often *doesn't* help and can actively hurt. Pushing thinking tokens from ~1,100 to ~16K dropped accuracy from 87.3% to 70.3% — a non-monotonic curve where models overthink easy problems and underthink hard ones Does more thinking time always improve reasoning accuracy?. And when longer thinking *does* help, the mechanism may be unflattering: extended traces appear to work by inflating output variance so a wider sampling net covers the right answer more often, not by reasoning better — and past a threshold the distribution gets too diffuse and accuracy falls Does extended thinking actually improve reasoning or just increase variance?. So 'more tokens' is partly buying lottery tickets, not insight.
If tokens aren't where the value lives, where is it? Several notes point to *which* compute, not *how much*. Only ~20% of tokens are high-entropy 'forking points,' and training on just those matches full-gradient updates Do high-entropy tokens drive reasoning model improvements?. Models internally rank tokens by function, preserving symbolic computation and discarding grammar and meta-discourse first Which tokens in reasoning chains actually matter most?. Most striking: deliberately corrupted, semantically irrelevant traces teach about as well as correct ones — the trace acts as computational scaffolding rather than meaningful steps Do reasoning traces need to be semantically correct?. Together these suggest the productive part of a long chain is a small, structural core, and the rest is padding you could compress away.
There's also a ceiling worth knowing about. Inference compute can't substitute for the right training: non-reasoning models never catch up to reasoning models no matter how large the inference budget, because training installs a protocol that makes extra tokens productive in the first place Can non-reasoning models catch up with more compute?. And the gains don't travel well — chain-of-thought degrades predictably under shifts in task, length, or format, producing fluent but logically broken output Does chain-of-thought reasoning actually generalize beyond training data?. When framework comparisons are controlled for total compute, the choice of search algorithm (BoN vs. MCTS) washes out — what matters is total budget and reward-function quality, not the wrapper Does the choice of reasoning framework actually matter for test-time performance?.
The more interesting takeaway is that 'test-time compute' is becoming several distinct axes you can trade against each other. You can spend it in latent space, in search rather than reasoning (agentic deep research shows search budget follows the same scaling-then-diminishing curve as reasoning tokens Does search budget scale like reasoning tokens for answer quality?), or on cheap parallel verification (asynchronous verifiers police a single trace at near-zero latency on correct runs Can verifiers monitor reasoning without slowing generation down?). So the honest answer to 'how much without more tokens' is: meaningfully, because the token count was never the real lever — placement, training, and the small set of decision points carry the signal, and visible verbosity is often just the exhaust.
Sources 12 notes
Multiple architectures—depth-recurrent models, Heima, and Coconut—demonstrate that test-time compute scales through hidden state iteration rather than token generation. This suggests verbalization is a training artifact, not a reasoning requirement.
Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.
Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.
Longer thinking traces improve accuracy through variance expansion—broader output distributions cover correct answers more often—not through better reasoning. Beyond a critical threshold, the distribution becomes too diffuse and accuracy drops, revealing the mechanism is sampling coverage, not genuine reasoning improvement.
Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.
Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.
Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.
Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.
DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.
Information-theoretic analysis shows BoN and MCTS converge in reasoning accuracy when controlling for total compute. Snowball errors accumulate per step regardless of framework; mitigation depends on search scope and reward function reliability, not the specific algorithm.
Agentic deep research shows monotonic-to-diminishing-returns curves for search iterations, matching reasoning token scaling. This creates a new inference-compute axis: models can trade off reasoning budget against search budget to optimize answer quality.
Decoupling verification from generation lets verifiers run alongside a single trace, forking to extract verifiable state and intervening only on violations. On correct runs the latency penalty is near-zero; interwhen matches or beats CoT across benchmarks at similar token budgets.