SYNTHESIS NOTE

Does token spending drive multi-agent research performance?

Multi-agent systems outperform single agents substantially, but what actually accounts for that improvement? Is it intelligent coordination or simply spending more tokens on the same task?

Synthesis note · 2026-02-23 · sourced from Agents Multi Architecture

Anthropic's internal evaluation of their multi-agent research system reveals a surprising decomposition: on the BrowseComp evaluation, token usage by itself explains 80% of the performance variance, with the number of tool calls and model choice as the remaining two explanatory factors. Together, these three factors explain 95% of variance.

The implication is uncomfortable: multi-agent systems work primarily because they spend enough tokens, not because they coordinate intelligently. Subagents facilitate compression by operating in parallel with their own context windows, exploring different aspects simultaneously before condensing the most important tokens for the lead agent. Each subagent provides separation of concerns — distinct tools, prompts, and exploration trajectories — which reduces path dependency.

However, the economics are revealing. Multi-agent with Claude Opus as lead and Claude Sonnet subagents outperforms single-agent Opus by 90.2% on breadth-first research. Agents use roughly 4× more tokens than chat interactions, and multi-agent systems use approximately 15× more tokens than chats. Upgrading to a newer Claude Sonnet is a larger performance gain than doubling the token budget on the older model — meaning model capability multiplies token efficiency.

The practical design principle: multi-agent architectures effectively scale token usage for tasks that exceed the limits of single agents. But they excel specifically at tasks involving heavy parallelization, information that exceeds single context windows, and interfacing with numerous complex tools. Tasks requiring shared context or with many inter-agent dependencies are not a good fit.

Since Does search budget scale like reasoning tokens for answer quality?, the Anthropic finding extends the TTS law from search steps to token budget directly — and confirms that the scaling mechanism is fundamentally about compute quantity, not coordination quality.

Inquiring lines that use this note as a source 18

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

14 direct connections · 104 in 2-hop network ·medium cluster Open in graph ↗

Does token spending drive multi-agent research p… Does search budget scale like reasoning tokens for… Why does parallel reasoning outperform single chai… What makes deep research fundamentally different f… Do hierarchical retrieval architectures outperform…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does search budget scale like reasoning tokens for answer quality? Explores whether the test-time scaling law that applies to reasoning tokens also governs search-based retrieval in agentic systems. Understanding this relationship could reshape how we allocate inference compute between thinking and searching.
TTS law generalizes from search steps to token budget; both show monotonic-to-diminishing-returns curves
Why does parallel reasoning outperform single chain thinking? Does dividing a fixed token budget across multiple independent reasoning paths beat spending it all on one long chain? This explores how breadth and diversity in reasoning compare to depth.
multi-agent research is the system-level analog of parallel thinking; breadth-first search via parallel agents
What makes deep research fundamentally different from RAG? Explores whether current systems using the label 'deep research' actually meet a rigorous three-component definition involving multi-step gathering, cross-source synthesis, and iterative refinement, or if they're performing something narrower.
multi-agent research meets all three components; subagents handle multi-step gathering, lead handles synthesis
Do hierarchical retrieval architectures outperform flat ones on complex queries? Explores whether separating query planning from answer synthesis into distinct architectural components improves performance on multi-hop retrieval tasks compared to unified single-pass approaches.
Anthropic's lead/subagent hierarchy is an instantiation of this pattern

Does token spending drive multi-agent research performance?

Related concepts in this collection 4

Related papers in this collection 8

Search by related questions 4