SYNTHESIS NOTE
Agentic Systems and Tool Use

When does adding more agents actually help systems?

Multi-agent systems often fail in practice, but the reasons remain unclear. This research investigates whether coordination overhead, task properties, or system architecture determine when agents improve or degrade performance.

Synthesis note · 2026-02-23 · sourced from Agents Multi Architecture

The question of when multi-agent systems help and when they hurt has been answered with heuristics. This paper replaces heuristics with measurement. Across 180 configurations (5 architectures × 3 LLM families × 4 benchmarks), three dominant effects emerge:

1. Tool-coordination trade-off (β=−0.330, p<0.001): tool-heavy tasks suffer disproportionately from multi-agent overhead. The mechanism is token budget fragmentation — multi-agent systems split per-agent capacity, leaving insufficient tokens for complex tool orchestration. A 16-tool software engineering task under multi-agent coordination loses more than a 2-tool financial reasoning task.

2. Capability saturation (β=−0.408, p<0.001): once single-agent baselines exceed approximately 45% accuracy, coordination yields diminishing or negative returns. Coordination costs exceed improvement potential. This is a measurable threshold, not a vague guideline.

3. Topology-dependent error amplification: independent agents amplify errors 17.2× through unchecked propagation, while centralized coordination contains this to 4.4× via validation bottlenecks that catch errors before aggregation. The architecture is the error control mechanism.

The practical consequences are sharp. Centralized coordination improves performance by 80.9% on parallelizable tasks (financial reasoning). Decentralized coordination excels on dynamic web navigation (+9.2% vs +0.2%). But for sequential reasoning tasks, every multi-agent variant degrades performance by 39-70%. Architecture-task alignment, not agent count, determines success.

The predictive model (R²=0.513, 87% accuracy on held-out configurations) uses measurable task properties — not post-hoc analysis. This means architecture selection can be principled rather than intuitive. The underlying mechanisms are interpretable: fragmentation, overhead exceeding marginal gains, and error propagation without validation.

Since How should we balance parallel versus sequential compute at test time?, this finding provides the multi-agent instantiation: parallel multi-agent coordination helps for parallelizable tasks, hurts for sequential ones. The 45% saturation threshold adds a quantitative decision boundary that the TTS literature lacks.

MasRouter's per-query topology routing (from Arxiv/Routers): MasRouter directly addresses the topology-dependent error amplification finding. Rather than choosing a fixed topology and accepting its scaling limitations, MasRouter routes each query to the optimal collaboration mode (Chain/Tree/Graph) via a variational latent variable model. This transforms topology from a fixed architectural choice into a per-query routing decision — the system can use centralized coordination for tasks where error propagation matters (financial reasoning) and decentralized coordination for dynamic tasks (web navigation). The 87% prediction accuracy of the scaling laws framework suggests routing decisions could be validated: does MasRouter's topology selection correlate with what the scaling laws predict would work best? See What decisions must multi-agent routing systems optimize simultaneously?.

The endogeneity paradox: autonomy degree is itself a scaling variable. The largest coordination experiment to date (25,000 tasks, 8 models, 4-256 agents, Drop the Hierarchy and Roles) reveals that the optimal coordination topology is not fixed but depends on model capability. A hybrid protocol with fixed ordering but autonomous role selection outperforms both centralized (+14%) and fully autonomous (+44%) coordination. Below a capability threshold, the relationship reverses — weak models need rigid structure. This adds a fourth scaling law: the degree of endogenous coordination is capability-contingent. The topology-dependent error amplification finding from this note interacts with autonomy level: self-organizing agents with strong models develop voluntary self-abstention (agents withdraw when they lack competence) and dynamic role invention (5,006 unique roles from 8 agents), producing emergent structures that fixed topologies cannot match. See Do self-organizing agent teams outperform rigid hierarchies?.

SAS vs MAS capabilities converge as frontier models improve. "Single-agent or Multi-agent? Why Not Both?" (2025) finds that MAS benefits diminish as LLMs gain long-context reasoning, memory retention, and tool use — mitigating the limitations that originally motivated MAS designs. Three defect types formalized as dependency graph problems: node-level (bottleneck agent caps performance), edge-level (downstream agents overwhelmed by upstream inputs — analogous to overthinking from external information), path-level (indecisive errors propagate as crucial context is lost during inter-agent summarization). A hybrid SAS/MAS cascading approach using confidence-guided routing improves accuracy 1.1-12% while reducing costs up to 88%. The exception: AIME (hardest math) where MAS consistently outperforms, confirming MAS value for extreme difficulty.

Homogeneous-isolation scaling confirms diminishing returns and locates the cause (SIMAS, https://arxiv.org/abs/2606.00655). Where the three laws above mix heterogeneous architectures and models, "Scaling Behavior of Single LLM-Driven Multi-Agent Systems" isolates the collaboration variable alone — a homogeneous MAS (one base model, sequential inter-agent communication, the minimalist SIMAS framework) scaled purely by agent count. The result is clean: performance does not scale monotonically with agent count but follows diminishing returns governed by a trade-off between collaborative synergy and coordination overhead — the same synergy-vs-overhead mechanism behind the 45% saturation threshold, now shown in the absence of model or knowledge heterogeneity. Two refinements: effective collaboration first requires a sufficiently capable base LLM (below a capability floor, adding agents cannot help), and task type critically modulates the optimal agent count. The framing keeper is that collective intelligence is an emergent property contingent on deliberate interaction design, not a guaranteed outcome of agent plurality — without architectural support for synthesis and refinement, multi-agent dialogue just adds overhead. This is the homogeneous, cause-isolating complement to the predictive scaling laws.

Inquiring lines that use this note as a source 22

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
21 direct connections · 158 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

multi-agent scaling follows three quantitative laws — tool-coordination trade-off capability saturation at 45 percent and topology-dependent error amplification