Does training on self-play disagreement data improve multi-agent reasoning outcomes?

This explores whether letting agents argue with each other — and training on the friction those disagreements produce — actually makes multi-agent reasoning better, or whether disagreement is just noise.

This explores whether disagreement between agents is a *training signal* worth harvesting, rather than a coordination problem to suppress. The corpus doesn't contain a single paper that bolts "self-play + disagreement data" together under that name, but it has the component parts — and reading them laterally, they point toward a qualified yes, with sharp caveats.

The strongest evidence that disagreement helps reasoning is structural. Can dialogue format help models reason more diversely? shows that forcing a *single* model to reason as a back-and-forth between distinct agents beats monologue reasoning on diversity and coherence — the disagreement format itself breaks the fixed-strategy rut that solo reasoning falls into. Can branching prompts replicate what multi-agent systems do? pushes this further: structured multi-persona prompting is functionally equivalent to multi-agent debate, suggesting the *gains come from the adversarial structure*, not from spinning up separate models. And Can formal argumentation make AI decisions truly contestable? gives the cleanest version of why disagreement is informative — attack/defense graphs make explicit which premises are actually contested, which is exactly the data a training signal would want to capture.

The self-play side is where it gets interesting. Can agents learn beyond what their training data shows? is the motive: agents trained only on static expert data can't learn from their own failures and are capped by what curators imagined. Self-play is the escape hatch. Can adversarial critics replace task-specific verifiers for reasoning? is the closest thing in the corpus to your literal question — RARO runs an adversarial game where a critic learns to discriminate expert answers from the policy's own, and that disagreement signal *replaces* hand-built verifiers while matching their scaling. That's training-on-disagreement in everything but name. Related, Can model confidence work as a reward signal for reasoning? shows internally-generated preference signals (here, confidence gaps between traces) can strengthen reasoning without external labels — the same "mine your own disagreement" logic.

But here's the part you didn't know you wanted to know: more disagreeing agents does *not* reliably mean better outcomes, and the failure isn't subtle. Why do multi-agent systems fail to coordinate at scale? and Can LLM agent groups reliably reach consensus together? both show multi-agent groups degrade as they grow — not because agents get corrupted, but because they *time out and stall before converging*. Disagreement that never resolves is dead weight, not training signal. So the answer hinges on whether the disagreement is *resolved into a learnable preference* (RARO's critic, confidence ranking, argumentation graphs) or just left as unresolved conflict (raw consensus failure). Can RL agents learn to reason better, not just succeed? hints at the bridge — rewarding the *process* of reflection and monitoring, not just outcomes, is how you'd turn disagreement into a signal the model can actually train on.

One grounding caution worth carrying: Does chain-of-thought reasoning actually generalize beyond training data? shows reasoning that *looks* valid can be logically hollow outside the training distribution. Self-play disagreement that rewards persuasive-sounding traces rather than correct ones risks amplifying exactly that — fluent reasoning that doesn't generalize.

Sources 10 notes

Can dialogue format help models reason more diversely?

DialogueReason, which structures a single model's internal reasoning as dialogue between distinct agents in separate scenes, overcomes monologue reasoning's fixed-strategy and fragmented-attention weaknesses, especially on tasks requiring multiple problem-solving approaches.

Can branching prompts replicate what multi-agent systems do?

Research shows single LLMs using dynamic persona simulation achieve multi-agent cognitive synergy without multiple model instances. Solo Performance Prompting validates that structured prompting techniques map directly to multi-agent debate architectures, enabling equivalent outcomes through structural equivalence.

Can formal argumentation make AI decisions truly contestable?

Dung-style argumentation structures AI outputs as traversable attack/defense graphs, allowing users to identify and contest specific premises. Standard LLM outputs lack this structure, making it impossible to pinpoint which claims users actually reject.

Can agents learn beyond what their training data shows?

Agents trained on static expert datasets cannot learn from their own failures or generalize beyond demonstrated scenarios because they never interact with environments during training. Competence is capped by what curators imagined, not by agent capacity.

Can adversarial critics replace task-specific verifiers for reasoning?

RARO uses an adversarial game where a critic discriminates expert from policy answers, eliminating the need for domain-specific verifiers while matching the scaling properties of verifier-based RL. The approach works across Countdown, DeepMath, and Poetry Writing tasks.

Can model confidence work as a reward signal for reasoning?

RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.

Why do multi-agent systems fail to coordinate at scale?

AgentsNet benchmark shows agents fail to coordinate strategies either by agreeing too late or adopting strategies without informing neighbors. Agents accept neighbor information without verification, enabling error propagation while remaining capable of detecting direct conflicts.

Can LLM agent groups reliably reach consensus together?

Across hundreds of simulations, LLM-agent groups frequently fail to reach valid agreement due to timeouts and stalled convergence rather than subtle value corruption. Agreement degrades with group size even without Byzantine agents present.

Can RL agents learn to reason better, not just succeed?

RLVMR uses structured meta-reasoning tags (planning, exploration, reflection, monitoring) with programmatic rewards to train agentic RL. This reduces repetitive actions by 31% compared to outcome-only methods while maintaining better generalization than supervised fine-tuning alone.

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst testing whether training on self-play disagreement data truly improves multi-agent reasoning. This question remains open despite recent work.

What a curated library found — and when (dated claims, not current truth):
A library spanning 2023–2026 identified these components:
• Dialogue-based reasoning (agents reasoning back-and-forth) outperforms monologue on diversity and coherence; multi-persona prompting is functionally equivalent to multi-agent debate (~2025).
• Adversarial feedback loops (critic learning to discriminate expert from policy outputs) match hand-built verifiers while scaling; confidence gaps between traces can serve as intrinsic preference signals without external labels (~2025).
• Structured argumentation (attack/defense graphs) makes contested premises explicit — the ideal training signal (~2024).
• **Critical constraint:** Multi-agent groups degrade predictably as they grow; unresolved disagreement that times out is dead weight, not signal (~2025). Reasoning that looks valid can be logically hollow outside training distribution (~2025).

Anchor papers (verify; mind their dates):
• arXiv:2305.19118 (2023) — Divergent Thinking via Multi-Agent Debate
• arXiv:2405.02079 (2024) — Argumentative LLMs for Contestable Decisions
• arXiv:2507.21931 (2025) — RL from Self-Feedback
• arXiv:2511.21667 (2025) — Escaping the Verifier via Demonstrations

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For unresolved disagreement and scaling degradation, probe whether orchestration advances (memory caching, hierarchical multi-agent frameworks, timeout-aware consensus) or newer critic designs have *relaxed* the liveness/convergence bottleneck. For distribution-bounded reasoning, check whether recent self-play formulations explicitly optimize for out-of-distribution robustness (e.g., adversarial curriculum, meta-reasoning rewards). Distinguish the durable problem (disagreement must resolve into learnable signal) from possibly-solved limitations (group coordination, distribution shift).
(2) **Surface contradicting or superseding work from the last 6 months.** Look for papers claiming multi-agent disagreement *without* resolution still yields gains, or showing self-play converges faster than the 2025 literature suggests.
(3) **Propose 2 research questions assuming the regime has moved:** (a) Can verifiable meta-reasoning rewards operationalize *which disagreements matter* in self-play, avoiding fluent-but-hollow traces? (b) Does hierarchical or temporally-gated multi-agent disagreement (resolve locally, escalate globally) outperform flat consensus on reasoning tasks?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Does training on self-play disagreement data improve multi-agent reasoning outcomes?

Sources 10 notes

Next inquiring lines