Does search budget scale like reasoning tokens for answer quality?

Explores whether the test-time scaling law that applies to reasoning tokens also governs search-based retrieval in agentic systems. Understanding this relationship could reshape how we allocate inference compute between thinking and searching.

Synthesis note · 2026-02-21 · sourced from Deep Research

The test-time scaling framework — more inference compute yields better answers up to a threshold — has been documented for reasoning token budgets in chain-of-thought models. The Agentic Deep Research finding extends this to search: more search steps, more retrieval rounds, better answers. The relationship follows the same shape.

This matters because it multiplies the design space for inference-time compute. Before, the question was "how many tokens to think?" Now there are two axes: reasoning budget per query and search budget per query. They are not independent — longer chains may require more retrieval to validate intermediate steps, and more retrieval may require more reasoning to synthesize. The optimal allocation problem gets harder.

The practical implication is that "deep research quality" is not a fixed property of a model — it is a function of the search budget you give it. A mid-sized model with a large search budget can outperform a large model with a restricted one. This shifts cost optimization from training compute to inference architecture, specifically the retrieval loop.

The finding also reframes what "thinking harder" means for agents. For single-turn reasoning models, thinking harder means more tokens per response. For search agents, thinking harder means more search-retrieve-synthesize iterations. How should we balance parallel versus sequential compute at test time? applies here too: the question of whether to parallelize retrieval across multiple query variants (parallel) or chain them iteratively (sequential) is the same structural trade-off operating at the retrieval level.

CoRAG (Chain-of-Retrieval Augmented Generation) extends this from agentic search behavior to explicitly trained retrieval models. Training via rejection sampling generates intermediate retrieval chains; test-time compute is controlled via decoding strategies (greedy / best-of-N / tree search). The same monotonic scaling relationship holds: more retrieval budget yields better answers on multi-hop QA. The TTS scaling law is not specific to reasoning tokens or agentic search — it is a general property of any iterative process with quality-sensitive intermediate steps. See Can retrieval be extended into multi-step chains like reasoning?.

Search-R1 and R1-Searcher demonstrate RL-based approaches that teach LLMs to autonomously invoke search during reasoning. Search-R1 (2025) uses retrieved token masking for stable RL training and a simple outcome-based reward, achieving 24% improvement (Qwen2.5-7B) over RAG baselines. The model learns multi-turn search with <search>/<information> token pairs. R1-Searcher (2025) introduces a two-stage approach: first a retrieve-reward incentivizes the model to conduct retrieval operations correctly, then an answer-reward encourages effective utilization of retrieved knowledge. Both demonstrate that RL training enables test-time scaling of tool calls — models learn to invoke search more frequently and more effectively as task difficulty increases, confirming the search-budget scaling law.

Inquiring lines that use this note as a source 59

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

18 direct connections · 159 in 2-hop network ·medium cluster Open in graph ↗

Does search budget scale like reasoning tokens f… Can we allocate inference compute based on prompt … How should we balance parallel versus sequential c… How do internal and external test-time scaling com…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can we allocate inference compute based on prompt difficulty? Does adjusting how much compute each prompt receives—rather than using a fixed budget—improve model performance? Could smarter allocation let smaller models compete with larger ones?
extends: search budget is now a second compute axis alongside reasoning tokens; adaptive allocation must account for both
How should we balance parallel versus sequential compute at test time? Test-time compute can prioritize breadth (trying many approaches) or depth (refining one approach). Which strategy works better, and does the answer depend on the problem?
applies: parallel retrieval (multiple query variants) vs sequential retrieval (chained iterations) is the same structural trade-off
How do internal and external test-time scaling compare? Explores whether test-time scaling approaches fundamentally differ in where compute is spent: during training (internal) versus at inference (external). Understanding this split clarifies the trade-offs in deployment strategy and reasoning capability.
extends: search-based DR is the clearest case of external TTS; this finding quantifies its scaling behavior

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

agentic deep research exhibits a test-time scaling law where search budget determines answer quality creating a new inference-compute axis

Does search budget scale like reasoning tokens for answer quality?

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 4