Can messy multi-agent transcripts become better training data than clean outputs?

This explores whether the noise in multi-agent transcripts — dead-ends, corrections, failed attempts, disagreements — carries more useful training signal than polished final answers, and the corpus has surprisingly strong, conflicting evidence on both sides.

This reads the question as a bet against the instinct to clean up your data: are the messy intermediate steps of agents arguing, failing, and backtracking actually richer to learn from than the tidy outputs we usually keep? The corpus suggests the answer is a qualified yes — with one sharp caveat about what kind of mess.

The strongest case for mess comes from two findings that attack the value of "clean" from opposite ends. First, curated expert demonstrations turn out to be a ceiling, not a floor: agents trained only on polished static datasets can't learn from their own failures and never generalize past what the curator already imagined, so competence is capped by the dataset author rather than the agent Can agents learn beyond what their training data shows?. Clean outputs, in other words, encode the answer but throw away the search that found it. Second, and more startling, deliberately corrupted reasoning traces train models about as well as correct ones — and sometimes generalize *better* out of distribution — which implies the trace works as computational scaffolding rather than as a transcript of correct thinking Do reasoning traces need to be semantically correct?. If even broken reasoning teaches, then the polish we prize may be cosmetic.

There's also a diversity argument hiding here. RL post-training tends to collapse onto a single dominant format from pretraining within the first epoch, suppressing the alternatives — and the winning format depends on model scale, not on which format actually performs best Does RL training collapse format diversity in pretrained models?. Messy multi-agent transcripts preserve exactly the variety that this collapse destroys: multiple solution paths, competing framings, the record of roads not taken. The mess is the diversity.

But the mess only pays off when failure is *legible*. Reflexion shows agents improving across episodes by storing verbal self-diagnoses in memory — and the reason it works is that the feedback signal is unambiguous (success/failure), which stops the agent from rationalizing its mistakes; the failures have to be clearly labeled as failures to become useful Can agents learn from failure without updating their weights?. The counter-case sharpens this: MetaGPT found that agents coordinating through standardized structured artifacts beat agents exchanging free-form natural language, precisely because conversational noise had to be eliminated for coordination to work Does structured artifact sharing outperform conversational coordination?. And agentic evaluation systems show how unstructured intermediate state turns toxic — an error in one memory module cascaded into everything downstream Can agents evaluate AI outputs more reliably than language models?.

So the corpus resolves the question into a distinction the question itself hides: messy-as-*exploratory* (failed paths, alternative solutions, diagnosed errors) is valuable training signal that clean outputs strip away; messy-as-*unstructured* (ambiguous coordination chatter, uncontained errors) actively degrades both training and runtime. The win condition for using transcripts as training data isn't cleaning them — it's instrumenting them, so that every dead-end is labeled as a dead-end and every failure is legible as a failure.

Sources 6 notes

Can agents learn beyond what their training data shows?

Agents trained on static expert datasets cannot learn from their own failures or generalize beyond demonstrated scenarios because they never interact with environments during training. Competence is capped by what curators imagined, not by agent capacity.

Do reasoning traces need to be semantically correct?

Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Can agents learn from failure without updating their weights?

Reflexion demonstrates that unambiguous environmental feedback (success/failure) enables agents to write useful self-diagnoses and improve across episodes without parameter updates. The binary signal prevents rationalization, and keeping reflections uncompressed preserves their usability.

Does structured artifact sharing outperform conversational coordination?

MetaGPT demonstrates that agents producing standardized engineering documents achieve superior coordination compared to conversational exchange. Active information pulling from shared environments eliminates noise and mirrors efficient human workplace infrastructure.

Can agents evaluate AI outputs more reliably than language models?

Eight-module agentic evaluation achieved 0.27% judge shift versus 31% for LLM-as-a-Judge on complex tasks. However, the memory module cascaded errors, revealing that agentic systems need error isolation mechanisms to maintain gains.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing whether messy multi-agent transcripts outperform clean outputs as training data. The question remains open; treat the findings below as dated claims to be verified against current models and methods.

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat all as perishable:
• Expert demonstrations cap agent generalization; agents trained only on polished datasets cannot learn from their own failures (~2023–25).
• Deliberately corrupted reasoning traces train models comparably to correct traces and sometimes generalize better out-of-distribution, suggesting trace acts as computational scaffolding rather than correct-thinking record (~2025).
• RL post-training collapses onto a single dominant pretraining format within one epoch, suppressing solution-path diversity; the winning format depends on model scale, not performance (~2025, arXiv:2504.07912).
• Verbal self-diagnosis stored as episodic memory improves agent performance across episodes, but ONLY when feedback is unambiguous (success/failure clearly labeled) (~2023).
• Unstructured conversational noise degrades multi-agent coordination; standardized structured artifacts outperform free-form natural language; errors in one module cascade downstream (~2023–25).

Anchor papers (verify; mind their dates):
• arXiv:2305.19118 (2023) — Multi-Agent Debate
• arXiv:2504.07912 (2025) — Echo Chamber: RL Post-training Amplifies Behaviors
• arXiv:2511.20639 (2026) — Latent Collaboration in Multi-Agent Systems
• arXiv:2510.20733 (2025) — Thought Communication in Multiagent Collaboration

Your task:
(1) RE-TEST THE LEGIBILITY CONSTRAINT. The library claims messy transcripts only pay off when failure is *legible* (unambiguously labeled). Test whether newer training methods (e.g., outcome-based filtering, self-play with contrastive labeling, graph-based error attribution, or automated failure categorization) have relaxed the legibility requirement — i.e., can models now *learn to extract signal from noise* without explicit labels? Separately, does the collapse behavior in arXiv:2504.07912 still hold under current multi-objective RL or mixture-of-experts post-training?
(2) Surface contradicting work from the last 6 months: find papers arguing that *structured* standardization (templates, schemas) outweighs exploratory diversity, or that unstructured mess actively improves generalization in ways the library missed.
(3) Propose two successor questions: (a) Can intermediate-token legibility (arXiv:2505.13775's claim) replace explicit failure labels, allowing raw transcripts to train without human annotation? (b) Does continuous memory update (arXiv:2605.12978) mean agents trained on evolving multi-agent transcripts suffer degradation that clean static datasets avoid, and if so, can version-aware memory fix it?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Can messy multi-agent transcripts become better training data than clean outputs?

Sources 6 notes

Next inquiring lines