When do multi-agent systems actually outperform single agents?
As individual LLMs grow more capable, does the advantage of splitting work across multiple agents still hold? This explores when coordination overhead makes MAS counterproductive.
"Single-agent or Multi-agent Systems? Why Not Both?" (2025) provides an empirical and theoretical analysis of when multi-agent systems (MAS) help versus hurt, with a finding that challenges the default toward multi-agent architectures.
The diminishing advantage. Prior studies reported MAS accuracy superiority across diverse domains. However, as frontier LLMs rapidly advance in long-context reasoning, memory retention, and tool usage, many limitations that originally motivated MAS designs are being mitigated by single-agent capability improvements. The empirical study finds that across various agentic applications, the performance gap between MAS and SAS narrows with stronger models — and SAS consistently outperforms MAS in a substantial portion of cases.
Three MAS defect types formalized as dependency graph problems:
Node-level defect: Both MAS and SAS performance are bottlenecked by the critical agent responsible for the most difficult subtask. MAS cannot escape the ceiling set by its weakest critical component. Adding more agents does not help if the hardest subtask remains unsolved.
Edge-level defect: Downstream agents become overwhelmed by inputs from upstream agents. In multi-way conversations or prolonged iterative refinements, high in-degree nodes (summarizers, synthesizers) receive more information than they can process effectively, leading to overthinking on edge cases. This is "analogous to the overthinking of the reasoning model, but rather than being lost in thinking, the agent becomes overwhelmed by inputs from upstream agents." MAS aggravates the problem because agents process much more data.
Path-level defect: Indecisive errors propagate through chains of agent interactions. Crucial context is lost or diluted when intermediate outputs are summarized or filtered. Even small information loss causes irreversible errors downstream via snowball effects. The specific failure mode: correct solutions proposed in earlier rounds get lost during summarization before reaching the next agent — "this loss is unrecoverable, as downstream agents no longer have access to the full previous results."
The hybrid solution. Confidence-guided routing between SAS and MAS — request cascading — selectively offloads requests based on difficulty. The approach improves accuracy by 1.1-12% while reducing costs up to 88%. AIME (hardest math) is the exception where MAS consistently outperforms, illustrating MAS value for extremely difficult tasks.
This extends When does adding more agents actually help systems?: the scaling laws quantify MAS overhead, while this paper shows the overhead becoming less worthwhile as single-agent capability increases. Since Why do multi-agent LLM systems converge without genuine deliberation?, MAS suffers from both coordination overhead AND pseudo-agreement — making the case for SAS with selective MAS escalation.
Inquiring lines that use this note as a source 24
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How do multi-agent LLM systems fail at coordination and role consistency?
- Why does diversity without expertise produce worse results than a single capable agent?
- What specific network sizes trigger coordination degradation in LLM systems?
- How do multi-agent systems improve on single frontier models?
- How do static team decomposition and dynamic agent selection compare in efficiency?
- How can humans oversee multiple partial-progress agents simultaneously?
- Do parallel LLM workers coordinate emergently without predefined collaboration rules?
- Which research tasks are better suited for multi-agent versus single-agent approaches?
- Does parallel task structure determine optimal multi-agent architecture?
- At what task difficulty does multi-agent decomposition become worth the coordination cost?
- How does distributed coordination fail as agent networks scale?
- Does internal task decomposition eliminate overhead from multi-agent coordination?
- What capability threshold do agents need to self-organize effectively?
- Does horizontal coordination improve with stronger individual agents?
- Can multiple small models outperform a single large model with good routing?
- At what capability threshold does multi-agent coordination stop helping?
- Does group size have predictable effects on LLM agent agreement rates?
- Which layer of agent systems creates the largest capability gains in practice?
- How do evaluation methods differ for single versus multi-agent systems?
- What four decisions matter most in multi-agent system routing?
- Does model capability still matter once coordination infrastructure is optimized?
- Can a single dominant mechanism replace the combined effect of all five?
- Can multi-agent teams solve problems better than single models thinking longer?
- When does multi-agent scaling actually outperform static ensembles?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
When does adding more agents actually help systems?
Multi-agent systems often fail in practice, but the reasons remain unclear. This research investigates whether coordination overhead, task properties, or system architecture determine when agents improve or degrade performance.
quantifies the overhead; this paper shows it becoming less worthwhile
-
Why do multi-agent LLM systems converge without genuine deliberation?
Multi-agent reasoning systems are designed to improve answers through debate, but often agents simply agree with early confident claims rather than genuinely disagreeing. What drives this pattern and how common is it?
MAS suffers coordination overhead AND pseudo-agreement
-
Does token spending drive multi-agent research performance?
Multi-agent systems outperform single agents substantially, but what actually accounts for that improvement? Is it intelligent coordination or simply spending more tokens on the same task?
if tokens drive performance, a single capable model may be more efficient than many smaller ones
-
Does more thinking time always improve reasoning accuracy?
Explores whether extending a model's thinking tokens linearly improves performance, or if there's a point beyond which additional reasoning becomes counterproductive.
edge-level defect is external-input-induced overthinking paralleling internal overthinking
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets
- Single-agent or Multi-agent Systems? Why Not Both?
- Scaling Behavior of Single LLM-Driven Multi-Agent Systems
- Why Do Multi-agent LLM Systems Fail?
- Towards a Science of Scaling Agent Systems
- LLMs Corrupt Your Documents When You Delegate
- How we built our multi-agent research system
- Drop the Hierarchy and Roles: How Self-Organizing LLM Agents Outperform Designed Structures
Original note title
multi-agent system advantages diminish as single-agent LLM capabilities improve — three defect types in MAS dependency graphs explain when single beats multi