How does search budget affect answer quality at test time?

This explores whether giving a model more room to search at inference time — more retrieval rounds, more iterations — actually buys better answers, and where that trade-off stops paying off.

This explores whether giving a model more room to search at inference time — more retrieval rounds, more iterations — actually buys better answers, and where that trade-off stops paying off. The short version from the corpus: search behaves like a compute knob you can turn, with the same shape of payoff curve as the more familiar 'let the model think longer' knob — and like any knob, it has a sweet spot.

The cleanest result is that search budget follows its own test-time scaling law. As you let an agent do more search iterations, answer quality climbs and then flattens into diminishing returns — the exact monotonic-then-plateau curve we already see with reasoning tokens Does search budget scale like reasoning tokens for answer quality?. The interesting consequence is that reasoning and searching become two interchangeable ways to spend the same inference budget: a model can trade thinking for looking-things-up, and you can tune the mix. Why search is worth spending on at all is its own finding — agents that retrieve from the live web beat models that rely on memorized knowledge, not because they reason better but because real-time search dodges the temporal staleness and lossy compression baked into training data Why do search agents beat memorized retrieval on hard questions?.

But 'more budget' is the wrong frame; 'budget allocated to the right place' is the right one. Uniform spending is wasteful — easy prompts get overserved while hard ones starve — and reallocating the *same* total compute adaptively by prompt difficulty beats spending more uniformly How should we allocate compute budget at inference time? Can we allocate inference compute based on prompt difficulty?. There's also a counterintuitive twist for multi-turn search: piling reasoning into a single turn can actively *hurt*, because it burns the context window the agent needs to absorb evidence from later retrieval rounds. Capping reasoning per turn — not just capping total time — preserves quality across iterations Does limiting reasoning per turn improve multi-turn search quality?. So budget isn't just a quantity, it's a schedule.

There's a deeper point about what's converging and what isn't. When you control for total compute, the specific search algorithm barely matters — best-of-N and tree search land at the same accuracy, and what actually governs whether errors snowball is the search scope and the reliability of your reward/value function, not the framework name on the box Does the choice of reasoning framework actually matter for test-time performance?. In other words, more search amplifies whatever signal is steering it; a bad reward function just lets you spend more compute going wrong faster.

The part you didn't know you wanted to know: even when the budget genuinely improves the answer, the *measurement* of quality is shaky. Benchmarks reward over-specified single-turn queries that look nothing like real search, so high scores don't predict satisfied users Why do search agents fail users despite strong benchmark scores?. And users themselves are easily fooled — they trust answers with more citations almost as much whether those citations are relevant or not, so citation count works as a decoupled trust heuristic Do users trust citations more when there are simply more of them?. Which means a system can spend its search budget to manufacture the *appearance* of quality. The honest takeaway: search budget reliably improves answers up to a plateau, but only as far as your reward signal is trustworthy and your evaluation measures something real.

Sources 8 notes

Does search budget scale like reasoning tokens for answer quality?

Agentic deep research shows monotonic-to-diminishing-returns curves for search iterations, matching reasoning token scaling. This creates a new inference-compute axis: models can trade off reasoning budget against search budget to optimize answer quality.

Why do search agents beat memorized retrieval on hard questions?

DeepResearcher agents trained on live web search beat static knowledge models on knowledge-intensive tasks. The mechanism is not better reasoning but retrieval: real-time search avoids temporal bounds and probabilistic compression that plague training-data memorization.

How should we allocate compute budget at inference time?

Research shows that dynamically adjusting inference compute per prompt—rather than using fixed budgets—improves performance and efficiency. Uniform spending wastes resources on easy problems while underserving hard ones.

Can we allocate inference compute based on prompt difficulty?

Research shows inference effectiveness varies dramatically by prompt difficulty. Reallocating the same total compute adaptively—giving easy prompts less and hard ones more—substantially outperforms larger models under uniform budgets.

Does limiting reasoning per turn improve multi-turn search quality?

Unrestricted reasoning within single search turns consumes context needed for subsequent retrieval rounds, degrading the agent's ability to incorporate new evidence. Setting per-turn reasoning budgets, not just overall time limits, prevents this context erosion and maintains search quality across iterations.

Does the choice of reasoning framework actually matter for test-time performance?

Information-theoretic analysis shows BoN and MCTS converge in reasoning accuracy when controlling for total compute. Snowball errors accumulate per step regardless of framework; mitigation depends on search scope and reward function reliability, not the specific algorithm.

Why do search agents fail users despite strong benchmark scores?

Search benchmarks use over-specified queries, single-turn interactions, and fixed schemas—none of which match real search. These design choices make benchmarks measure retrieval, not collaborative intent refinement, explaining why high scores don't predict user satisfaction.

Do users trust citations more when there are simply more of them?

Analysis of 24,000 Search Arena interactions shows irrelevant citations boost user preference (β=0.273) nearly as much as relevant citations (β=0.285), indicating citation count functions as a decoupled trust heuristic.

How does search budget affect answer quality at test time?

Sources 8 notes

Next inquiring lines