SYNTHESIS NOTE

When does adding more agents actually help systems?

Multi-agent systems often fail in practice, but the reasons remain unclear. This research investigates whether coordination overhead, task properties, or system architecture determine when agents improve or degrade performance.

Synthesis note · 2026-02-23 · sourced from Agents Multi Architecture

The question of when multi-agent systems help and when they hurt has been answered with heuristics. This paper replaces heuristics with measurement. Across 180 configurations (5 architectures × 3 LLM families × 4 benchmarks), three dominant effects emerge:

1. Tool-coordination trade-off (β=−0.330, p<0.001): tool-heavy tasks suffer disproportionately from multi-agent overhead. The mechanism is token budget fragmentation — multi-agent systems split per-agent capacity, leaving insufficient tokens for complex tool orchestration. A 16-tool software engineering task under multi-agent coordination loses more than a 2-tool financial reasoning task.

2. Capability saturation (β=−0.408, p<0.001): once single-agent baselines exceed approximately 45% accuracy, coordination yields diminishing or negative returns. Coordination costs exceed improvement potential. This is a measurable threshold, not a vague guideline.

3. Topology-dependent error amplification: independent agents amplify errors 17.2× through unchecked propagation, while centralized coordination contains this to 4.4× via validation bottlenecks that catch errors before aggregation. The architecture is the error control mechanism.

The practical consequences are sharp. Centralized coordination improves performance by 80.9% on parallelizable tasks (financial reasoning). Decentralized coordination excels on dynamic web navigation (+9.2% vs +0.2%). But for sequential reasoning tasks, every multi-agent variant degrades performance by 39-70%. Architecture-task alignment, not agent count, determines success.

The predictive model (R²=0.513, 87% accuracy on held-out configurations) uses measurable task properties — not post-hoc analysis. This means architecture selection can be principled rather than intuitive. The underlying mechanisms are interpretable: fragmentation, overhead exceeding marginal gains, and error propagation without validation.

Since How should we balance parallel versus sequential compute at test time?, this finding provides the multi-agent instantiation: parallel multi-agent coordination helps for parallelizable tasks, hurts for sequential ones. The 45% saturation threshold adds a quantitative decision boundary that the TTS literature lacks.

MasRouter's per-query topology routing (from Arxiv/Routers): MasRouter directly addresses the topology-dependent error amplification finding. Rather than choosing a fixed topology and accepting its scaling limitations, MasRouter routes each query to the optimal collaboration mode (Chain/Tree/Graph) via a variational latent variable model. This transforms topology from a fixed architectural choice into a per-query routing decision — the system can use centralized coordination for tasks where error propagation matters (financial reasoning) and decentralized coordination for dynamic tasks (web navigation). The 87% prediction accuracy of the scaling laws framework suggests routing decisions could be validated: does MasRouter's topology selection correlate with what the scaling laws predict would work best? See What decisions must multi-agent routing systems optimize simultaneously?.

The endogeneity paradox: autonomy degree is itself a scaling variable. The largest coordination experiment to date (25,000 tasks, 8 models, 4-256 agents, Drop the Hierarchy and Roles) reveals that the optimal coordination topology is not fixed but depends on model capability. A hybrid protocol with fixed ordering but autonomous role selection outperforms both centralized (+14%) and fully autonomous (+44%) coordination. Below a capability threshold, the relationship reverses — weak models need rigid structure. This adds a fourth scaling law: the degree of endogenous coordination is capability-contingent. The topology-dependent error amplification finding from this note interacts with autonomy level: self-organizing agents with strong models develop voluntary self-abstention (agents withdraw when they lack competence) and dynamic role invention (5,006 unique roles from 8 agents), producing emergent structures that fixed topologies cannot match. See Do self-organizing agent teams outperform rigid hierarchies?.

SAS vs MAS capabilities converge as frontier models improve. "Single-agent or Multi-agent? Why Not Both?" (2025) finds that MAS benefits diminish as LLMs gain long-context reasoning, memory retention, and tool use — mitigating the limitations that originally motivated MAS designs. Three defect types formalized as dependency graph problems: node-level (bottleneck agent caps performance), edge-level (downstream agents overwhelmed by upstream inputs — analogous to overthinking from external information), path-level (indecisive errors propagate as crucial context is lost during inter-agent summarization). A hybrid SAS/MAS cascading approach using confidence-guided routing improves accuracy 1.1-12% while reducing costs up to 88%. The exception: AIME (hardest math) where MAS consistently outperforms, confirming MAS value for extreme difficulty.

Homogeneous-isolation scaling confirms diminishing returns and locates the cause (SIMAS, https://arxiv.org/abs/2606.00655). Where the three laws above mix heterogeneous architectures and models, "Scaling Behavior of Single LLM-Driven Multi-Agent Systems" isolates the collaboration variable alone — a homogeneous MAS (one base model, sequential inter-agent communication, the minimalist SIMAS framework) scaled purely by agent count. The result is clean: performance does not scale monotonically with agent count but follows diminishing returns governed by a trade-off between collaborative synergy and coordination overhead — the same synergy-vs-overhead mechanism behind the 45% saturation threshold, now shown in the absence of model or knowledge heterogeneity. Two refinements: effective collaboration first requires a sufficiently capable base LLM (below a capability floor, adding agents cannot help), and task type critically modulates the optimal agent count. The framing keeper is that collective intelligence is an emergent property contingent on deliberate interaction design, not a guaranteed outcome of agent plurality — without architectural support for synthesis and refinement, multi-agent dialogue just adds overhead. This is the homogeneous, cause-isolating complement to the predictive scaling laws.

Inquiring lines that use this note as a source 22

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

21 direct connections · 158 in 2-hop network ·medium cluster Open in graph ↗

When does adding more agents actually help syste… How should we balance parallel versus sequential c… Why does parallel reasoning outperform single chai… Why do multi-agent LLM systems converge without ge… Can extreme task decomposition enable reliable exe… What decisions must multi-agent routing systems op…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

How should we balance parallel versus sequential compute at test time? Test-time compute can prioritize breadth (trying many approaches) or depth (refining one approach). Which strategy works better, and does the answer depend on the problem?
the same parallel/sequential dichotomy at the agent level rather than the token level
Why does parallel reasoning outperform single chain thinking? Does dividing a fixed token budget across multiple independent reasoning paths beat spending it all on one long chain? This explores how breadth and diversity in reasoning compare to depth.
single-agent token-level parallel scaling; multi-agent is the system-level analog with different economics
Why do multi-agent LLM systems converge without genuine deliberation? Multi-agent reasoning systems are designed to improve answers through debate, but often agents simply agree with early confident claims rather than genuinely disagreeing. What drives this pattern and how common is it?
error amplification connects: independent agents propagate errors; silent agreement is one mechanism
Can extreme task decomposition enable reliable execution at million-step scale? Can breaking tasks into maximally atomic subtasks with voting-based error correction solve the fundamental reliability problem in long-horizon tasks? This challenges whether better models or better decomposition is the path to high-reliability AI systems.
MAKER's extreme decomposition as one architecture choice; this paper quantifies when decomposition helps vs hurts
What decisions must multi-agent routing systems optimize simultaneously? Standard LLM routing only picks which model to use. But multi-agent systems involve four interdependent choices: topology, agent count, role assignment, and per-agent model selection. Does optimizing all four together actually improve performance?
MasRouter: per-query topology routing as response to topology-dependent error amplification

When does adding more agents actually help systems?

Related concepts in this collection 5

Related papers in this collection 8

Search by related questions 4