When does adding more agents actually help systems?
Multi-agent systems often fail in practice, but the reasons remain unclear. This research investigates whether coordination overhead, task properties, or system architecture determine when agents improve or degrade performance.
The question of when multi-agent systems help and when they hurt has been answered with heuristics. This paper replaces heuristics with measurement. Across 180 configurations (5 architectures × 3 LLM families × 4 benchmarks), three dominant effects emerge:
1. Tool-coordination trade-off (β=−0.330, p<0.001): tool-heavy tasks suffer disproportionately from multi-agent overhead. The mechanism is token budget fragmentation — multi-agent systems split per-agent capacity, leaving insufficient tokens for complex tool orchestration. A 16-tool software engineering task under multi-agent coordination loses more than a 2-tool financial reasoning task.
2. Capability saturation (β=−0.408, p<0.001): once single-agent baselines exceed approximately 45% accuracy, coordination yields diminishing or negative returns. Coordination costs exceed improvement potential. This is a measurable threshold, not a vague guideline.
3. Topology-dependent error amplification: independent agents amplify errors 17.2× through unchecked propagation, while centralized coordination contains this to 4.4× via validation bottlenecks that catch errors before aggregation. The architecture is the error control mechanism.
The practical consequences are sharp. Centralized coordination improves performance by 80.9% on parallelizable tasks (financial reasoning). Decentralized coordination excels on dynamic web navigation (+9.2% vs +0.2%). But for sequential reasoning tasks, every multi-agent variant degrades performance by 39-70%. Architecture-task alignment, not agent count, determines success.
The predictive model (R²=0.513, 87% accuracy on held-out configurations) uses measurable task properties — not post-hoc analysis. This means architecture selection can be principled rather than intuitive. The underlying mechanisms are interpretable: fragmentation, overhead exceeding marginal gains, and error propagation without validation.
Since How should we balance parallel versus sequential compute at test time?, this finding provides the multi-agent instantiation: parallel multi-agent coordination helps for parallelizable tasks, hurts for sequential ones. The 45% saturation threshold adds a quantitative decision boundary that the TTS literature lacks.
MasRouter's per-query topology routing (from Arxiv/Routers): MasRouter directly addresses the topology-dependent error amplification finding. Rather than choosing a fixed topology and accepting its scaling limitations, MasRouter routes each query to the optimal collaboration mode (Chain/Tree/Graph) via a variational latent variable model. This transforms topology from a fixed architectural choice into a per-query routing decision — the system can use centralized coordination for tasks where error propagation matters (financial reasoning) and decentralized coordination for dynamic tasks (web navigation). The 87% prediction accuracy of the scaling laws framework suggests routing decisions could be validated: does MasRouter's topology selection correlate with what the scaling laws predict would work best? See What decisions must multi-agent routing systems optimize simultaneously?.
The endogeneity paradox: autonomy degree is itself a scaling variable. The largest coordination experiment to date (25,000 tasks, 8 models, 4-256 agents, Drop the Hierarchy and Roles) reveals that the optimal coordination topology is not fixed but depends on model capability. A hybrid protocol with fixed ordering but autonomous role selection outperforms both centralized (+14%) and fully autonomous (+44%) coordination. Below a capability threshold, the relationship reverses — weak models need rigid structure. This adds a fourth scaling law: the degree of endogenous coordination is capability-contingent. The topology-dependent error amplification finding from this note interacts with autonomy level: self-organizing agents with strong models develop voluntary self-abstention (agents withdraw when they lack competence) and dynamic role invention (5,006 unique roles from 8 agents), producing emergent structures that fixed topologies cannot match. See Do self-organizing agent teams outperform rigid hierarchies?.
SAS vs MAS capabilities converge as frontier models improve. "Single-agent or Multi-agent? Why Not Both?" (2025) finds that MAS benefits diminish as LLMs gain long-context reasoning, memory retention, and tool use — mitigating the limitations that originally motivated MAS designs. Three defect types formalized as dependency graph problems: node-level (bottleneck agent caps performance), edge-level (downstream agents overwhelmed by upstream inputs — analogous to overthinking from external information), path-level (indecisive errors propagate as crucial context is lost during inter-agent summarization). A hybrid SAS/MAS cascading approach using confidence-guided routing improves accuracy 1.1-12% while reducing costs up to 88%. The exception: AIME (hardest math) where MAS consistently outperforms, confirming MAS value for extreme difficulty.
Homogeneous-isolation scaling confirms diminishing returns and locates the cause (SIMAS, https://arxiv.org/abs/2606.00655). Where the three laws above mix heterogeneous architectures and models, "Scaling Behavior of Single LLM-Driven Multi-Agent Systems" isolates the collaboration variable alone — a homogeneous MAS (one base model, sequential inter-agent communication, the minimalist SIMAS framework) scaled purely by agent count. The result is clean: performance does not scale monotonically with agent count but follows diminishing returns governed by a trade-off between collaborative synergy and coordination overhead — the same synergy-vs-overhead mechanism behind the 45% saturation threshold, now shown in the absence of model or knowledge heterogeneity. Two refinements: effective collaboration first requires a sufficiently capable base LLM (below a capability floor, adding agents cannot help), and task type critically modulates the optimal agent count. The framing keeper is that collective intelligence is an emergent property contingent on deliberate interaction design, not a guaranteed outcome of agent plurality — without architectural support for synthesis and refinement, multi-agent dialogue just adds overhead. This is the homogeneous, cause-isolating complement to the predictive scaling laws.
Inquiring lines that use this note as a source 22
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How does the agentic layer amplify individual agent failure modes?
- What distinguishes task failure from communication breakdown in multi-agent systems?
- When does multi-agent voting help versus hurt performance on tasks?
- Which research tasks are better suited for multi-agent versus single-agent approaches?
- Does parallel task structure determine optimal multi-agent architecture?
- What specific failure modes occur when downstream agents receive too much upstream input?
- At what task difficulty does multi-agent decomposition become worth the coordination cost?
- How does collaboration topology choice affect error amplification in multi-agent systems?
- Which failure mode most limits current multi-agent performance?
- How does distributed coordination fail as agent networks scale?
- What coordination failures emerge when multiple agents work together?
- What tasks do AI agents still fail at most often?
- What capability threshold do agents need to self-organize effectively?
- Does horizontal coordination improve with stronger individual agents?
- At what capability threshold does multi-agent coordination stop helping?
- Which ecosystem conditions matter most for agent deployment success?
- Which layer of agent systems creates the largest capability gains in practice?
- How do evaluation methods differ for single versus multi-agent systems?
- Where does agent reliability come from if not better tools?
- What four decisions matter most in multi-agent system routing?
- How do externalizing cognitive work and coordination infrastructure relate to agent reliability?
- Can multi-agent teams solve problems better than single models thinking longer?
Related concepts in this collection 5
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
How should we balance parallel versus sequential compute at test time?
Test-time compute can prioritize breadth (trying many approaches) or depth (refining one approach). Which strategy works better, and does the answer depend on the problem?
the same parallel/sequential dichotomy at the agent level rather than the token level
-
Why does parallel reasoning outperform single chain thinking?
Does dividing a fixed token budget across multiple independent reasoning paths beat spending it all on one long chain? This explores how breadth and diversity in reasoning compare to depth.
single-agent token-level parallel scaling; multi-agent is the system-level analog with different economics
-
Why do multi-agent LLM systems converge without genuine deliberation?
Multi-agent reasoning systems are designed to improve answers through debate, but often agents simply agree with early confident claims rather than genuinely disagreeing. What drives this pattern and how common is it?
error amplification connects: independent agents propagate errors; silent agreement is one mechanism
-
Can extreme task decomposition enable reliable execution at million-step scale?
Can breaking tasks into maximally atomic subtasks with voting-based error correction solve the fundamental reliability problem in long-horizon tasks? This challenges whether better models or better decomposition is the path to high-reliability AI systems.
MAKER's extreme decomposition as one architecture choice; this paper quantifies when decomposition helps vs hurts
-
What decisions must multi-agent routing systems optimize simultaneously?
Standard LLM routing only picks which model to use. But multi-agent systems involve four interdependent choices: topology, agent count, role assignment, and per-agent model selection. Does optimizing all four together actually improve performance?
MasRouter: per-query topology routing as response to topology-dependent error amplification
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Towards a Science of Scaling Agent Systems
- Artifacts as Memory Beyond the Agent Boundary
- Scaling Behavior of Single LLM-Driven Multi-Agent Systems
- Drop the Hierarchy and Roles: How Self-Organizing LLM Agents Outperform Designed Structures
- Why Do Multi-agent LLM Systems Fail?
- Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets
- How we built our multi-agent research system
- A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems
Original note title
multi-agent scaling follows three quantitative laws — tool-coordination trade-off capability saturation at 45 percent and topology-dependent error amplification