Why do frontier models corrupt more documents than weaker models during workflows?
This explores a counterintuitive result from long delegated-editing benchmarks: it's not that frontier models necessarily touch more documents, but that their failures shift from visible deletion to silent, plausible-looking corruption that surface competence hides.
This explores a finding that sounds backwards at first — that the smartest models damage your documents more than dumber ones during long workflows. The sharper version of the claim is about *how* they fail, not *how much*. Across 19 models and 52 domains, even advanced systems quietly corrupt roughly a quarter of document content over extended relay tasks, and the errors keep compounding through 50 round-trips without ever plateauing Do frontier LLMs silently corrupt documents in long workflows?. The damage doesn't announce itself.
The key is that failure has a model-tier signature. Weaker models fail loudly: they delete content, which is jarring and easy to catch. Frontier models fail quietly: they rewrite, paraphrase, and 'improve' — producing edits that read fluently and look competent but drift away from the source Do frontier models fail differently than weaker models?. So the apparent paradox is really a detectability problem. A missing paragraph trips an alarm; a confidently reworded-but-wrong sentence sails through review. The frontier model's strength — generating polished, plausible text — is exactly what makes its corruption invisible.
If you suspect the tools are to blame, the corpus closes that door. Giving models richer agentic tool access doesn't fix this; the degradation originates upstream, in the model's *judgment about what to change*, not in the editing interface Can better tools fix LLM document editing errors?. The model decides an edit is warranted when it isn't. That's a reasoning failure dressed up as a capability.
Why does it compound rather than settle? Because the model's own earlier mistakes become part of the context it reads on the next pass. Prior errors in the history bias future reasoning, and performance degrades non-linearly as the contaminated context grows — and notably, *scaling the model doesn't rescue you*. Only spending more test-time compute (thinking models that reason before acting) blunts the effect by keeping error-poisoned context from steering the next step Do models fail worse when their own errors fill the context?. This is the same self-reinforcing trap seen elsewhere: reasoning models that wander down and abandon paths do so because of structural disorganization, not lack of horsepower Why do reasoning models abandon promising solution paths?.
The useful takeaway is that 'more capable' and 'more trustworthy in a pipeline' are different axes. The lever that helps isn't a bigger model — it's structural: distributing work so no single context window accumulates and amplifies its own drift. Multi-agent orchestration outperforms single autonomous agents on exactly the long-synthesis tasks where context failures bite Can specialized agents write better scientific papers than single models?, and routing each step to the right specialized model can beat one frontier model outright Can routing beat building one better model?. The thing you didn't know you wanted to know: the danger of a frontier model in a workflow scales with how much you trust its fluency.
Sources 7 notes
Testing 19 models across 52 domains shows even advanced systems degrade documents by ~25% over extended relay tasks, with errors compounding silently without plateauing through 50 round-trips.
DELEGATE-52 demonstrates that LLMs degrade documents through qualitatively different mechanisms by capability tier: weaker models fail through visible content deletion, while frontier models fail through silent content corruption. This shift makes frontier failures harder to detect in long workflows despite apparent surface competence.
DELEGATE-52 shows that agentic tool access fails to improve performance on long-horizon document tasks. The degradation mechanism originates upstream in the model's judgment about what to change, not in editing interface limitations.
Error accumulation in context causes non-linear performance degradation in long-horizon tasks. Model scaling does not fix this; only test-time compute through thinking models reduces the effect by preventing error-contaminated context from biasing reasoning.
Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.
PaperOrchestra's specialized agents achieved 50-68% absolute win margins on literature review quality and 14-38% on overall manuscript quality versus autonomous baselines in human evaluation. Distributed coordination prevents single-model context window failures on complex synthesis tasks.
Avengers-Pro achieves 7% higher accuracy than GPT-5-medium by routing queries to optimal models per semantic cluster, or matches its performance at 27% lower cost. Ten 7B models with routing previously surpassed GPT-4.1 and 4.5, suggesting selection is a stronger lever than scaling.