SYNTHESIS NOTE
Agentic Systems and Tool Use Training, RL, and Test-Time Scaling Model Architecture and Internals

Can multiple agents stay diverse during training together?

Does training separate specialist agents on different data maintain the reasoning diversity that single-agent finetuning destroys? This matters because diversity correlates with accuracy and prevents models from becoming trapped in narrow response patterns.

Synthesis note · 2026-02-23 · sourced from Agents Multi
What actually constrains large language models from self-improvement? What makes multi-agent teams actually perform better?

Single-agent self-improvement through iterative finetuning hits a wall fast. After one round of finetuning on its own generated outputs, performance saturates and begins to drop — the model becomes fixated on a narrow range of responses, limiting diversity and degrading accuracy. This is the training-time analog of Does a model improve by arguing with itself? at inference time: a single model trapped in its own distribution.

The multiagent finetuning framework (Du et al., 2025) proposes a structural fix: instead of training one model iteratively, train a society of models, each starting from the same base but independently specialized through distinct training data generated via multi-agent interactions. Generation agents produce initial responses; critic agents evaluate and refine them through debate. Each model sees different data because the interactions are role-dependent.

The mechanism works because role specialization prevents convergence to a single mode. When one model is trained to generate and another to critique, their training distributions diverge, maintaining the diversity that single-agent training destroys. The summarization step between debate rounds further helps by eliminating redundant information and retaining critical points — removing summarization hurts performance. Removing critics also degrades output quality, confirming that the evaluative role is load-bearing, not decorative.

This connects directly to Does policy entropy collapse limit reasoning performance in RL?: the entropy collapse that limits RL training is mitigated when multiple agents maintain distinct policy distributions. And since Why do LLMs generate novel ideas from narrow ranges?, the training-time diversity preservation through multi-agent specialization could address the output-time diversity problem upstream.

The cost is real — multiple model copies for training and inference. But the finding that single-agent FT collapses after one iteration means the choice is not "cheap single-agent" vs "expensive multi-agent" but "one iteration of productive training" vs "sustained improvement across many rounds."

Inquiring lines that use this note as a source 10

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
17 direct connections · 152 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

multi-agent finetuning preserves reasoning diversity by training agents on distinct data and roles — single-agent self-improvement saturates after one iteration