Does search budget scale like reasoning tokens for answer quality?
Explores whether the test-time scaling law that applies to reasoning tokens also governs search-based retrieval in agentic systems. Understanding this relationship could reshape how we allocate inference compute between thinking and searching.
The test-time scaling framework — more inference compute yields better answers up to a threshold — has been documented for reasoning token budgets in chain-of-thought models. The Agentic Deep Research finding extends this to search: more search steps, more retrieval rounds, better answers. The relationship follows the same shape.
This matters because it multiplies the design space for inference-time compute. Before, the question was "how many tokens to think?" Now there are two axes: reasoning budget per query and search budget per query. They are not independent — longer chains may require more retrieval to validate intermediate steps, and more retrieval may require more reasoning to synthesize. The optimal allocation problem gets harder.
The practical implication is that "deep research quality" is not a fixed property of a model — it is a function of the search budget you give it. A mid-sized model with a large search budget can outperform a large model with a restricted one. This shifts cost optimization from training compute to inference architecture, specifically the retrieval loop.
The finding also reframes what "thinking harder" means for agents. For single-turn reasoning models, thinking harder means more tokens per response. For search agents, thinking harder means more search-retrieve-synthesize iterations. How should we balance parallel versus sequential compute at test time? applies here too: the question of whether to parallelize retrieval across multiple query variants (parallel) or chain them iteratively (sequential) is the same structural trade-off operating at the retrieval level.
CoRAG (Chain-of-Retrieval Augmented Generation) extends this from agentic search behavior to explicitly trained retrieval models. Training via rejection sampling generates intermediate retrieval chains; test-time compute is controlled via decoding strategies (greedy / best-of-N / tree search). The same monotonic scaling relationship holds: more retrieval budget yields better answers on multi-hop QA. The TTS scaling law is not specific to reasoning tokens or agentic search — it is a general property of any iterative process with quality-sensitive intermediate steps. See Can retrieval be extended into multi-step chains like reasoning?.
Search-R1 and R1-Searcher demonstrate RL-based approaches that teach LLMs to autonomously invoke search during reasoning. Search-R1 (2025) uses retrieved token masking for stable RL training and a simple outcome-based reward, achieving 24% improvement (Qwen2.5-7B) over RAG baselines. The model learns multi-turn search with <search>/<information> token pairs. R1-Searcher (2025) introduces a two-stage approach: first a retrieve-reward incentivizes the model to conduct retrieval operations correctly, then an answer-reward encourages effective utilization of retrieved knowledge. Both demonstrate that RL training enables test-time scaling of tool calls — models learn to invoke search more frequently and more effectively as task difficulty increases, confirming the search-budget scaling law.
Inquiring lines that use this note as a source 59
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Can models learn when to invoke search during reasoning tasks?
- How should we allocate compute between reasoning and retrieval iterations?
- Does parallel retrieval outperform sequential search chains at test time?
- Why does retrieval chain training unlock scaling laws in QA?
- How can per-step decisions about knowledge retrieval improve reasoning over uniform policies?
- Can the scaling law for discovery extend beyond architectures to agentic systems?
- What makes some tokens carry disproportionate information about answers?
- How does hierarchical query planning versus flat prompting affect multi-source retrieval?
- How does search budget affect answer quality at test time?
- How does the three-component definition apply to test-time scaling laws?
- What scaling behavior do partial systems show without iterative query refinement?
- How do real search queries reveal what counts as a deep research question?
- Can test-time scaling prioritize genuine reasoning over pattern matching?
- Does inference-time compute scaling require explicit reasoning traces or verifiable rewards?
- What determines the optimal thinking token threshold for a given task?
- Why do parallel and sequential test-time search methods produce equivalent results under fixed budgets?
- How does test-time search budget efficiency benefit from hierarchical architectures?
- What mechanisms drive test-time compute allocation in reasoning tasks?
- Can step-level rewards improve training of agentic retrieval systems?
- Does test-time compute scaling work for agentic deep research tasks?
- When does simulated search outperform real search for agent training?
- How much does inference budget improve self-generated search performance?
- Could real-time search systems avoid era sensitivity in legal reasoning?
- What role does search capacity play in making debate more accurate?
- How does test-time scaling relate to token budget in agentic deep research?
- How should inference-time token budgets vary across models of different capability levels?
- How should token budgets be allocated when prompt-inference coupling matters?
- How does speed of AI search prevent real-time supervision and evaluation?
- What limits exist on retrieval budget during inference?
- How much does test-time compute improve reasoning without more tokens?
- How should inference compute budget be allocated across different prompt difficulties?
- What makes search budget matter for research task performance?
- Do search agents face their own overthinking threshold like reasoning models do?
- What is the optimal balance between search rounds and reasoning depth per round?
- Does parallel token spending always beat sequential spending at the same budget?
- How does proactive information-gathering capability differ from passive knowledge retrieval?
- What computational cost does trajectory-bursty inference impose on per-query context requirements?
- Why do per-turn thinking budgets matter alongside iterative retrieval depth?
- How do knowledge graphs scale as training data for open-ended search tasks?
- Can test-time compute budgets be allocated differently per query difficulty?
- Why do deep research agents outperform retrieval augmented generation systems?
- How do tool invocations drive agentic cost beyond token consumption?
- What makes inference budgets allocate adaptively per prompt difficulty?
- Can test-time scaling work through retrieval rather than reasoning?
- Can test-time scaling compound through memory consolidation into a new scaling law?
- How do timing and search internalization interact during reasoning post-training?
- Should artifact-level benchmarks replace token counts for agent evaluation?
- What inference-time scaling benefits emerge from reasoning before each prediction?
- Can high benchmark scores mislead deployment decisions for search agents?
- How do reward models guide inference-time compute allocation decisions?
- How should retrieval systems handle multi-hop reasoning and iterative information needs?
- Can inference budgets be allocated adaptively based on prompt difficulty?
- Should prompt design and inference scaling be optimized together or separately?
- Does policy entropy collapse prevent inference-time search from finding solutions?
- How does machine feedback enable discovery at test time?
- Should agents use parallel or sequential scaling during test time?
- How should experiment budgets be allocated across parallel hypothesis-testing teams?
- What other agent behaviors besides citations reveal reasoning quality?
- Do gains from harness-based agents transfer across different search benchmarks?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can we allocate inference compute based on prompt difficulty?
Does adjusting how much compute each prompt receives—rather than using a fixed budget—improve model performance? Could smarter allocation let smaller models compete with larger ones?
extends: search budget is now a second compute axis alongside reasoning tokens; adaptive allocation must account for both
-
How should we balance parallel versus sequential compute at test time?
Test-time compute can prioritize breadth (trying many approaches) or depth (refining one approach). Which strategy works better, and does the answer depend on the problem?
applies: parallel retrieval (multiple query variants) vs sequential retrieval (chained iterations) is the same structural trade-off
-
How do internal and external test-time scaling compare?
Explores whether test-time scaling approaches fundamentally differ in where compute is spent: during training (internal) versus at inference (external). Understanding this split clarifies the trade-offs in deployment strategy and reasoning capability.
extends: search-based DR is the clearest case of external TTS; this finding quantifies its scaling behavior
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- From Web Search towards Agentic Deep Research: Incentivizing Search with Reasoning Agents
- Learning to Think: Information-Theoretic Reinforcement Fine-Tuning for LLMs
- Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets
- ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning
- Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
- Reasoning Models Can Be Effective Without Thinking
- Does Thinking More always Help? Understanding Test-Time Scaling in Reasoning Models
- The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
Original note title
agentic deep research exhibits a test-time scaling law where search budget determines answer quality creating a new inference-compute axis