How do internal and external test-time scaling compare?

Explores whether test-time scaling approaches fundamentally differ in where compute is spent: during training (internal) versus at inference (external). Understanding this split clarifies the trade-offs in deployment strategy and reasoning capability.

Synthesis note · 2026-02-20 · sourced from Test Time Compute

Every test-time scaling approach belongs to one of two categories:

Internal TTS: Train the model so it generates long chain-of-thought reasoning autonomously, without external scaffolding. Requires SFT on long CoT data, RL to reinforce reasoning, or TTT (parameter updates at inference). The model self-organizes compute allocation. Examples: o1, DeepSeek-R1, QwQ.
External TTS: Use inference-time infrastructure — search algorithms, verifiers, reward models — to steer a base model toward better outputs. The model's parameters are unchanged; compute is spent on search and evaluation. Examples: Best-of-N with PRM, MCTS, beam search, majority voting.

Internal and external TTS are complementary, not competing: internal TTS makes models better reasoners; external TTS extracts more performance from whatever reasoning capability exists. Combining them (e.g., using Best-of-N to boost a long-CoT model with a PRM) often outperforms either alone.

The practical distinction matters for deployment: internal scaling is a training cost paid once; external scaling is an inference cost paid per query. The economics push toward internal scaling at scale, but external scaling remains essential during development when training is expensive.

The finding that Can non-reasoning models catch up with more compute? illustrates the limits of external TTS alone: you need the internal foundation before external scaling can amplify it.

Inquiring lines that use this note as a source 35

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 7

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

22 direct connections · 207 in 2-hop network ·dense cluster Open in graph ↗

How do internal and external test-time scaling c… Can non-reasoning models catch up with more comput… How should we balance parallel versus sequential c… Can retrieval be extended into multi-step chains l… Can models precompute answers before users ask que… Can models reason without generating visible think… Does RL post-training create reasoning or just dep… Can modular cognitive tools unlock reasoning witho…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can non-reasoning models catch up with more compute? Explores whether inference-time compute budget can close the performance gap between standard models and those trained for reasoning, and what training mechanisms might enable this.
the limit of external TTS without internal foundation
How should we balance parallel versus sequential compute at test time? Test-time compute can prioritize breadth (trying many approaches) or depth (refining one approach). Which strategy works better, and does the answer depend on the problem?
a cross-cutting axis that applies within each category
Can retrieval be extended into multi-step chains like reasoning? Standard RAG retrieves once, but multi-hop tasks need intermediate steps. Can we train models to plan retrieval sequences the way chain-of-thought trains reasoning, and scale retrieval at test time?
CoRAG is a hybrid that escapes the internal/external binary: training teaches chain generation (internal) while compute dials (chain length/count) are applied at inference (external); retrieval-intensive tasks have their own TTS curve that this taxonomy did not originally capture
Can models precompute answers before users ask questions? Most LLM applications maintain persistent state across interactions. Could models use idle time between queries to precompute useful inferences about that context, reducing latency when users actually ask?
sleep-time compute fractures the dichotomy by adding a third temporal position: pre-interaction compute is neither internal (weights trained) nor external (inference-time search) but amortized pre-computation; the binary taxonomy needs a third category
Can models reason without generating visible thinking tokens? Explores whether intermediate reasoning must be verbalized as text tokens, or if models can think in hidden continuous space. Challenges a foundational assumption about how language models scale their reasoning capabilities.
challenges the taxonomy: latent recurrent depth-scaling is internal (architectural recurrence) but applied at inference (external compute dial), occupying a hybrid position the binary did not anticipate; verbalization is orthogonal to the internal/external split
Does RL post-training create reasoning or just deploy it? Investigates whether reasoning capability emerges during RL fine-tuning or already exists in base models. Matters because it reshapes how we build and optimize reasoning systems.
reframes "internal TTS": if RL teaches *when* to activate latent capability rather than how to reason, then "internal TTS" is more accurately deployment-timing optimization than capability instillation; the foundation that external TTS amplifies was already in the base model
Can modular cognitive tools unlock reasoning without training? Can reasoning capabilities be elicited by structuring LLM calls as isolated cognitive operations—understanding, recalling, examining, and backtracking—rather than through reinforcement learning?
third-category instance: cognitive tools elicit reasoning at inference time without weight updates AND without external search infrastructure — neither internal nor external in the original sense; the taxonomy needs to distinguish "trained to reason" from "scaffolded to reason"

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

internal vs external tts is the primary taxonomic split in test-time scaling research

How do internal and external test-time scaling compare?

Related concepts in this collection 7

Related papers in this collection 8

Search by related questions 4