Do frontier LLMs silently corrupt documents in long workflows?

Explores whether advanced language models introduce undetectable errors when delegated multi-step tasks, and whether degradation continues accumulating beyond initial rounds of processing.

Synthesis note · 2026-05-18 · sourced from Flaws

Delegation requires trust — the expectation that an LLM will execute a task without introducing errors. DELEGATE-52 stress-tests that expectation with 310 work environments across 52 domains (coding, crystallography, music notation, genealogy) and a round-trip relay protocol where each task is paired with its inverse, so a perfect model would recover the original document exactly.

Across 19 LLMs, even frontier systems (Gemini 3.1 Pro, Claude 4.6 Opus, GPT 5.4) corrupt an average of 25% of document content by the end of long workflows. Weaker models fail more severely. The degradation curve decelerates but does not plateau — the first half of an extended relay accounts for 2-3x more loss than the second half, yet the strongest model still drops below 60% accuracy by round-trip 50. Distractor files, longer documents, and longer interactions all worsen the rate.

The structural problem: errors are sparse but severe and they compound silently. A user reviewing one or two outputs sees competent work. A user delegating an end-to-end workflow gets a document that looks intact but contains accumulated drift in places they did not check. The trust assumption that holds at single-step interaction collapses at the timescale where delegation is actually valuable.

This is not a "weak model" finding. It is a ceiling on delegated work at the current frontier — one that scales unfavorably with exactly the workflow length that makes delegation attractive.

Inquiring lines that use this note as a source 109

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

14 direct connections · 145 in 2-hop network ·dense cluster Open in graph ↗

Do frontier LLMs silently corrupt documents in l… Do frontier models fail differently than weaker mo… Can better tools fix LLM document editing errors? Do short benchmarks predict how models perform ove… Do models fail worse when their own errors fill th… Why do language models fail to act on their own re…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Do frontier LLMs silently corrupt documents in long workflows?

Related concepts in this collection 5

Related papers in this collection 8

Search by related questions 5