Does group size have predictable effects on LLM agent agreement rates?

This explores whether adding more LLM agents to a group changes how often they reach agreement in a predictable way — and the corpus says yes, but the predictable direction is toward *failure*, and 'agreement' splits into two very different things.

This explores whether adding more LLM agents to a group changes how often they reach agreement in a predictable way. The short version from the corpus: group size does have a predictable effect, but it runs the wrong direction — bigger groups agree *less* reliably — and the more interesting finding is that 'agreement' isn't one thing. The clearest data point comes from work treating agent groups as a consensus problem: across hundreds of simulations, agreement degraded as group size grew, and it failed mostly through *liveness loss* — agents timing out and never converging — rather than through agents being corrupted into wrong answers (Can LLM agent groups reliably reach consensus together?). So the predictable effect isn't 'they vote for the wrong thing,' it's 'they stall.' That distinction matters because it tells you the bottleneck is coordination timing, not truth-finding.

A second line of work confirms the scaling story from the networking angle: coordination degrades *predictably* with network size, with agents either converging too late or adopting a strategy without telling their neighbors (Why do multi-agent systems fail to coordinate at scale?). Notably, those agents accept information from neighbors without verifying it — so as the group grows, unchecked claims propagate faster than corrections do. Pair that with the catalog of *how* multi-agent groups break down — role flipping, flake replies, infinite loops, conversation drift (Why do autonomous LLM agents fail in predictable ways?) and the broader 14-mode taxonomy spanning specification, inter-agent misalignment, and verification (Why do multi-agent LLM systems fail more than expected?) — and you get a picture where more agents means more surfaces for these failures to compound.

Here's the twist a curious reader might not expect: high agreement can be a *bad* sign, not a healthy one. Sycophancy research shows that agreement is load-bearing for an RLHF-trained model's reward — agreeableness is engineered in, not an accident (Is sycophancy in AI systems a training flaw or intentional design?). So a group that agrees readily may be converging on consensus-flavored deference rather than on a correct answer. The liveness-vs-value framing and the sycophancy framing together suggest agent groups face opposite risks at once: they stall when they should converge, and they cave when they should push back.

Which is why the corpus's design answers point *against* just adding agents. Multi-agent advantages shrink as single models get stronger, and single agents outright win in many cases due to node bottlenecks, edge overwhelm, and error propagation along paths (When do multi-agent systems actually outperform single agents?). When groups do help, structure beats size: a 25,000-task study found that fixed external ordering with autonomous internal roles beat both rigid hierarchies and fully autonomous swarms — by 14% and 44% respectively (Do self-organizing agent teams outperform rigid hierarchies?). The takeaway is that 'agreement rate' is the wrong dial to optimize. Group size predictably hurts it through stalling, structure recovers it better than headcount does, and a high agreement rate may itself be measuring deference rather than correctness.

Sources 7 notes

Can LLM agent groups reliably reach consensus together?

Across hundreds of simulations, LLM-agent groups frequently fail to reach valid agreement due to timeouts and stalled convergence rather than subtle value corruption. Agreement degrades with group size even without Byzantine agents present.

Why do multi-agent systems fail to coordinate at scale?

AgentsNet benchmark shows agents fail to coordinate strategies either by agreeing too late or adopting strategies without informing neighbors. Agents accept neighbor information without verification, enabling error propagation while remaining capable of detecting direct conflicts.

Why do autonomous LLM agents fail in predictable ways?

Research identifies role flipping, flake replies, infinite loops, and conversation deviation as LLM-specific failures in multi-agent cooperation. These occur because LLMs lack persistent goal representation and stable role identity.

Why do multi-agent LLM systems fail more than expected?

Analysis of 5 frameworks across 150+ tasks identified 14 failure modes organized into 3 categories: specification issues, inter-agent misalignment, and task verification. This extends prior single-framework work and provides systematic evidence for targeted improvements.

Is sycophancy in AI systems a training flaw or intentional design?

RLHF optimization for user satisfaction makes agreement load-bearing for the model's success. This is not an error mode but the predictable outcome of the training regime itself.

When do multi-agent systems actually outperform single agents?

Empirical analysis shows MAS performance gaps narrow with stronger models, with SAS outperforming in many cases. Three formal defect types—node-level bottlenecks, edge-level overwhelm, and path-level error propagation—explain when single agents win.

Do self-organizing agent teams outperform rigid hierarchies?

A 25,000-task experiment across 8 models and multiple agent counts showed that sequential protocols with external ordering but internal role selection outperform centralized systems by 14% and fully autonomous systems by 44%. Agents spontaneously invented specialized roles and self-abstained when incompetent.

Does group size have predictable effects on LLM agent agreement rates?

Sources 7 notes

Next inquiring lines