How much does test-time compute improve reasoning without more tokens?

This explores whether models can get better at reasoning by spending more compute *internally* — iterating in hidden states or rearranging where the work happens — rather than by generating more visible thinking tokens.

This explores whether models can get better at reasoning by spending more compute internally rather than by writing out longer chains of thought — and the corpus suggests the link between 'more tokens' and 'better reasoning' is weaker than it looks. The cleanest version of the idea is latent reasoning: architectures like depth-recurrent models, Heima, and Coconut scale test-time compute by iterating on hidden states, producing no extra visible tokens at all Can models reason without generating visible thinking tokens?. That this works hints that verbalization is a training artifact, not a requirement for reasoning. A related finding shows transformers already do this involuntarily — they compute the correct answer in layers 1–3, then overwrite it to emit format-compliant filler Do transformers hide reasoning before producing filler tokens?. The reasoning isn't in the tokens; the tokens are downstream of it.

The flip side is that piling on more tokens often *doesn't* help and can actively hurt. Pushing thinking tokens from ~1,100 to ~16K dropped accuracy from 87.3% to 70.3% — a non-monotonic curve where models overthink easy problems and underthink hard ones Does more thinking time always improve reasoning accuracy?. And when longer thinking *does* help, the mechanism may be unflattering: extended traces appear to work by inflating output variance so a wider sampling net covers the right answer more often, not by reasoning better — and past a threshold the distribution gets too diffuse and accuracy falls Does extended thinking actually improve reasoning or just increase variance?. So 'more tokens' is partly buying lottery tickets, not insight.

If tokens aren't where the value lives, where is it? Several notes point to *which* compute, not *how much*. Only ~20% of tokens are high-entropy 'forking points,' and training on just those matches full-gradient updates Do high-entropy tokens drive reasoning model improvements?. Models internally rank tokens by function, preserving symbolic computation and discarding grammar and meta-discourse first Which tokens in reasoning chains actually matter most?. Most striking: deliberately corrupted, semantically irrelevant traces teach about as well as correct ones — the trace acts as computational scaffolding rather than meaningful steps Do reasoning traces need to be semantically correct?. Together these suggest the productive part of a long chain is a small, structural core, and the rest is padding you could compress away.

There's also a ceiling worth knowing about. Inference compute can't substitute for the right training: non-reasoning models never catch up to reasoning models no matter how large the inference budget, because training installs a protocol that makes extra tokens productive in the first place Can non-reasoning models catch up with more compute?. And the gains don't travel well — chain-of-thought degrades predictably under shifts in task, length, or format, producing fluent but logically broken output Does chain-of-thought reasoning actually generalize beyond training data?. When framework comparisons are controlled for total compute, the choice of search algorithm (BoN vs. MCTS) washes out — what matters is total budget and reward-function quality, not the wrapper Does the choice of reasoning framework actually matter for test-time performance?.

The more interesting takeaway is that 'test-time compute' is becoming several distinct axes you can trade against each other. You can spend it in latent space, in search rather than reasoning (agentic deep research shows search budget follows the same scaling-then-diminishing curve as reasoning tokens Does search budget scale like reasoning tokens for answer quality?), or on cheap parallel verification (asynchronous verifiers police a single trace at near-zero latency on correct runs Can verifiers monitor reasoning without slowing generation down?). So the honest answer to 'how much without more tokens' is: meaningfully, because the token count was never the real lever — placement, training, and the small set of decision points carry the signal, and visible verbosity is often just the exhaust.

Sources 12 notes

Can models reason without generating visible thinking tokens?

Multiple architectures—depth-recurrent models, Heima, and Coconut—demonstrate that test-time compute scales through hidden state iteration rather than token generation. This suggests verbalization is a training artifact, not a reasoning requirement.

Do transformers hide reasoning before producing filler tokens?

Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.

Does more thinking time always improve reasoning accuracy?

Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.

Does extended thinking actually improve reasoning or just increase variance?

Longer thinking traces improve accuracy through variance expansion—broader output distributions cover correct answers more often—not through better reasoning. Beyond a critical threshold, the distribution becomes too diffuse and accuracy drops, revealing the mechanism is sampling coverage, not genuine reasoning improvement.

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Which tokens in reasoning chains actually matter most?

Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.

Do reasoning traces need to be semantically correct?

Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.

Can non-reasoning models catch up with more compute?

Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

Does the choice of reasoning framework actually matter for test-time performance?

Information-theoretic analysis shows BoN and MCTS converge in reasoning accuracy when controlling for total compute. Snowball errors accumulate per step regardless of framework; mitigation depends on search scope and reward function reliability, not the specific algorithm.

Does search budget scale like reasoning tokens for answer quality?

Agentic deep research shows monotonic-to-diminishing-returns curves for search iterations, matching reasoning token scaling. This creates a new inference-compute axis: models can trade off reasoning budget against search budget to optimize answer quality.

Can verifiers monitor reasoning without slowing generation down?

Decoupling verification from generation lets verifiers run alongside a single trace, forking to extract verifiable state and intervening only on violations. On correct runs the latency penalty is near-zero; interwhen matches or beats CoT across benchmarks at similar token budgets.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a reasoning-model researcher re-testing whether test-time compute gains reasoning without adding output tokens. The question remains live: *which forms of internal compute matter, and do they scale independently of verbalization?*

What a curated library found — and when (dated claims, not current truth):
Findings span Dec 2024–Feb 2026. A library of ~12 papers reports:
• Latent reasoning (depth-recurrent, Heima, Coconut architectures) scales test-time compute on hidden states alone, producing zero extra visible tokens (~2025).
• Transformers involuntarily compute correct answers in early layers, then overwrite them with format-compliant output; the reasoning precedes tokenization (~2024–2025).
• Pushing thinking tokens from ~1,100 to ~16K drops accuracy 87.3% → 70.3% in a non-monotonic curve; models overthink easy problems (~2025–2026).
• Only ~20% of tokens are high-entropy 'forking points'; training on just those matches full-gradient updates (~2026).
• Deliberately corrupted, semantically irrelevant traces teach as well as correct ones — the trace scaffolds computation, not meaning (~2025).
• Non-reasoning models never catch reasoning models even with unlimited inference budget; training installs a protocol that makes tokens productive (~2025).
• Chain-of-thought degrades under task/length/format shifts, producing fluent but logically broken output (~2025).

Anchor papers (verify; mind their dates):
• arXiv:2502.05171 (2025-02) — Latent reasoning & depth scaling
• arXiv:2506.04210 (2025-06) — Non-monotonic test-time scaling curve
• arXiv:2506.01939 (2025-06) — High-entropy token bottlenecks
• arXiv:2602.13517 (2026-02) — Deep-thinking vs. long-thinking metrics

Your task:
(1) RE-TEST EACH CONSTRAINT. For latent reasoning, search budget scaling, and the 87.3%→70.3% degradation curve: has newer hardware, training recipes (e.g., RL on latent actions), or multi-agent orchestration since relaxed these? Does the non-monotonicity still hold across model scales (7B→405B)? Does the 20% token bottleneck hold under different RL objectives? Separate the durable question (which compute *types* matter?) from the perishable claim (the specific token threshold).
(2) Surface the strongest work from the last ~6 months that *contradicts* the "tokens aren't the lever" consensus — e.g., evidence that verbosity *does* unlock new reasoning, or that latent scaling plateaus.
(3) Propose 2 research questions assuming the regime has shifted: (a) Can you *train* models to reason in latent space from the start, rather than retrofitting it? (b) Do reasoning gains from test-time compute + search saturate faster than from training, and at what compute ratio?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

How much does test-time compute improve reasoning without more tokens?

Sources 12 notes

Next inquiring lines