Where does sleep-time compute fit in the taxonomy of test-time scaling?

This explores where 'sleep-time compute' — doing inference work before a question is asked, during idle time — sits within the broader map of test-time scaling methods.

This explores where 'sleep-time compute' fits in the test-time scaling map — and the short answer is that the corpus treats it less as a category of its own and more as a shift in *when* compute happens rather than *how much*. Most test-time scaling research organizes itself around a single primary split: internal methods (training a model to reason autonomously) versus external methods (search and verification bolted on at inference) How do internal and external test-time scaling compare?. Sleep-time compute doesn't slot neatly into either axis — it lives on a different dimension entirely, which is why one of the corpus notes flags it alongside 'post-completion' compute as a *novel direction* that moves the timing of computation rather than its volume How should test-time scaling methods be categorized and designed?.

To see why that's interesting, it helps to notice that the classic scaling axes are all about volume and shape *at the moment of the query*. There's the parallel-versus-sequential trade-off — breadth of coverage against depth of reasoning How should we balance parallel versus sequential compute at test time? — and the finding that, once you control for total compute, the specific framework (best-of-N, MCTS) matters far less than the budget you spend Does the choice of reasoning framework actually matter for test-time performance?. Sleep-time compute sidesteps that whole framing: instead of asking how to spend tokens *now*, it asks what you can precompute *before* the user shows up, so the query-time latency bill is smaller.

The deepest connection the corpus offers is actually to pretraining, not inference. 'Thinking-augmented pretraining' generates reasoning traces and folds them into training data, achieving roughly 3x data efficiency — and the authors explicitly describe this as applying test-time-scaling-style reasoning *ahead of time* Can training data augmentation match test-time compute scaling benefits?. That's the same conceptual move as sleep-time compute: amortize reasoning into an idle window so it doesn't have to be paid for at the moment of demand. It also rhymes with the finding that inference compute and parameter scaling are not independent resources but trade against each other Can inference compute replace scaling up model size? — sleep-time compute is one more way to reshuffle *when* in the pipeline a fixed compute budget gets spent.

So the honest placement is: sleep-time compute is orthogonal to the dominant internal/external taxonomy. It's a 'when' axis crosscutting the field's usual 'how much' and 'what shape' axes. The reader curious about this should know the field hasn't fully absorbed it yet — it shows up as a frontier item How should test-time scaling methods be categorized and designed? rather than a settled branch, and its closest intellectual cousins are pretraining-time reasoning augmentation and the compute-substitution results, not the search-versus-reasoning debates it superficially resembles.

Sources 6 notes

How do internal and external test-time scaling compare?

Research shows test-time scaling methods split into internal (training models for autonomous reasoning) and external (inference-time search and verification). They complement rather than compete; internal builds capability while external extracts performance from existing capability.

How should test-time scaling methods be categorized and designed?

Research identifies internal vs external as the primary taxonomic split for test-time scaling, with training-side constraints (policy entropy collapse) and novel directions that shift *when* compute happens (sleep-time, post-completion) rather than just *how much*. Methods like consensus games and recursive LMs sidestep traditional scaling tradeoffs.

How should we balance parallel versus sequential compute at test time?

Parallel methods improve coverage; sequential methods enable depth. The optimal choice depends on task structure: parallel wins for independent short problems, sequential for compositional chains requiring intermediate accumulation.

Does the choice of reasoning framework actually matter for test-time performance?

Information-theoretic analysis shows BoN and MCTS converge in reasoning accuracy when controlling for total compute. Snowball errors accumulate per step regardless of framework; mitigation depends on search scope and reward function reliability, not the specific algorithm.

Can training data augmentation match test-time compute scaling benefits?

Augmenting pretraining data with LLM-generated reasoning traces improves data efficiency 3x and reasoning benchmark performance 10%+ for 3B models. Harder tokens automatically receive longer traces, creating a natural compute-allocation mechanism analogous to test-time scaling.

Can inference compute replace scaling up model size?

Snell et al. (2024) showed that inference-time compute trades off against model parameter scaling, especially on difficult prompts. This reveals pretraining and inference compute are not independent resources.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst tasked with re-testing whether sleep-time compute remains orthogonal to the internal/external test-time scaling taxonomy, or whether recent work has integrated it into a unified framework.

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026; treat these as perishable constraints to be verified:
• Sleep-time compute is taxonomically orthogonal to the dominant internal/external split; it is a 'when' axis rather than a 'how much' or 'what shape' axis (~2025).
• Thinking-augmented pretraining achieves ~3× data efficiency by applying test-time-scaling-style reasoning ahead of demand, conceptually mirroring sleep-time compute's amortization logic (~2025).
• Inference compute and parameter scaling trade against each other; sleep-time compute is one more reshuffling of *when* a fixed budget is spent (~2024).
• The field has not yet fully absorbed sleep-time compute; it appears as a frontier item, not a settled branch (~2025).
• Parallel-versus-sequential trade-off and total reasoning budget (not framework) dominate external slow-thinking efficacy (~2025).

Anchor papers (verify; mind their dates):
• arXiv:2408.03314 (2024-08): Scaling LLM Test-Time Compute Optimally
• arXiv:2509.20186 (2025-09): Thinking Augmented Pre-training
• arXiv:2506.04210 (2025-06): Does Thinking More always Help?
• arXiv:2604.02460 (2026-04): Single-Agent LLMs Outperform Multi-Agent Systems

Your task:
(1) RE-TEST THE ORTHOGONALITY CLAIM. Has recent work (last 6 months) proposed a unified taxonomy that merges the timing dimension with internal/external methods? Check whether newer orchestration (multi-agent, memory, caching), training procedures (RL-based reasoning allocation), or evaluation harnesses have collapsed this distinction or revealed hidden interdependencies. Plainly state which constraints still hold and which have shifted.
(2) Surface the strongest RECONCILING work: does any 2026 paper show that sleep-time and query-time compute are *not* independent, or that one subsumes the other under a larger principle?
(3) Propose 2 research questions that assume the taxonomy may be outdated: e.g., "Is the durable frontier the *total* compute budget across sleep, query, and training, not their timing?" or "Does sleep-time compute's advantage vanish when factoring in training-time inference costs?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Where does sleep-time compute fit in the taxonomy of test-time scaling?

Sources 6 notes

Next inquiring lines