Does token spending drive multi-agent research performance?
Multi-agent systems outperform single agents substantially, but what actually accounts for that improvement? Is it intelligent coordination or simply spending more tokens on the same task?
Anthropic's internal evaluation of their multi-agent research system reveals a surprising decomposition: on the BrowseComp evaluation, token usage by itself explains 80% of the performance variance, with the number of tool calls and model choice as the remaining two explanatory factors. Together, these three factors explain 95% of variance.
The implication is uncomfortable: multi-agent systems work primarily because they spend enough tokens, not because they coordinate intelligently. Subagents facilitate compression by operating in parallel with their own context windows, exploring different aspects simultaneously before condensing the most important tokens for the lead agent. Each subagent provides separation of concerns — distinct tools, prompts, and exploration trajectories — which reduces path dependency.
However, the economics are revealing. Multi-agent with Claude Opus as lead and Claude Sonnet subagents outperforms single-agent Opus by 90.2% on breadth-first research. Agents use roughly 4× more tokens than chat interactions, and multi-agent systems use approximately 15× more tokens than chats. Upgrading to a newer Claude Sonnet is a larger performance gain than doubling the token budget on the older model — meaning model capability multiplies token efficiency.
The practical design principle: multi-agent architectures effectively scale token usage for tasks that exceed the limits of single agents. But they excel specifically at tasks involving heavy parallelization, information that exceeds single context windows, and interfacing with numerous complex tools. Tasks requiring shared context or with many inter-agent dependencies are not a good fit.
Since Does search budget scale like reasoning tokens for answer quality?, the Anthropic finding extends the TTS law from search steps to token budget directly — and confirms that the scaling mechanism is fundamentally about compute quantity, not coordination quality.
Inquiring lines that use this note as a source 18
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- What happens to token value when populations surrender cognitively at different rates?
- How do multi-agent systems improve on single frontier models?
- When does multi-agent voting help versus hurt performance on tasks?
- What token budget tradeoff exists between parallel chains and aggregation?
- Why does literature review benefit most from multi-agent orchestration approaches?
- Do multi-agent systems justify their token costs with genuine quality gains?
- Why do multi-agent systems use 15 times more tokens than chat interactions?
- How does test-time scaling relate to token budget in agentic deep research?
- Does upgrading model capability improve token efficiency in agentic systems?
- Which research tasks are better suited for multi-agent versus single-agent approaches?
- Can token probability distributions extend swarm composition across different model architectures?
- Does parallel token spending always beat sequential spending at the same budget?
- Can latent communication reduce the token cost of multi-agent systems?
- When is 15x token overhead actually worth the compute cost?
- What metrics replace throughput per token for agent deployment?
- How do tool invocations drive agentic cost beyond token consumption?
- Can two agents with identical token counts produce vastly different outputs?
- Why do frontier models remain cost-effective despite higher token prices in production?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does search budget scale like reasoning tokens for answer quality?
Explores whether the test-time scaling law that applies to reasoning tokens also governs search-based retrieval in agentic systems. Understanding this relationship could reshape how we allocate inference compute between thinking and searching.
TTS law generalizes from search steps to token budget; both show monotonic-to-diminishing-returns curves
-
Why does parallel reasoning outperform single chain thinking?
Does dividing a fixed token budget across multiple independent reasoning paths beat spending it all on one long chain? This explores how breadth and diversity in reasoning compare to depth.
multi-agent research is the system-level analog of parallel thinking; breadth-first search via parallel agents
-
What makes deep research fundamentally different from RAG?
Explores whether current systems using the label 'deep research' actually meet a rigorous three-component definition involving multi-step gathering, cross-source synthesis, and iterative refinement, or if they're performing something narrower.
multi-agent research meets all three components; subagents handle multi-step gathering, lead handles synthesis
-
Do hierarchical retrieval architectures outperform flat ones on complex queries?
Explores whether separating query planning from answer synthesis into distinct architectural components improves performance on multi-hop retrieval tasks compared to unified single-pass approaches.
Anthropic's lead/subagent hierarchy is an instantiation of this pattern
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- How we built our multi-agent research system
- Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets
- Deep Research: A Systematic Survey
- LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries
- Towards a Science of Scaling Agent Systems
- Persistent AI Agents in Academic Research: A Single-Investigator Implementation Case Study
- Artifacts as Memory Beyond the Agent Boundary
- AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors in Agents
Original note title
multi-agent research performance is primarily a token spending function — token usage explains 80 percent of variance while model choice and tool calls explain the remainder