Why do frontier model failures in document editing go undetected by users?
This explores why frontier models' document-editing mistakes slip past users — the answer hinges on a capability-tier shift: better models stop deleting and start silently corrupting, which looks like competence.
This explores why frontier models' document-editing mistakes slip past users, rather than whether they make mistakes at all. The short version: the more capable the model, the more its failures disguise themselves as success. Testing across 19 models and 52 domains found that even advanced systems corrupt roughly 25% of document content over long delegated workflows, with errors quietly compounding through 50 round-trips and never plateauing Do frontier LLMs silently corrupt documents in long workflows?. The corruption doesn't announce itself, which is exactly the problem.
The reason detection fails is a difference in *how* models break things by capability tier. Weaker models tend to delete content — and missing text is visible, so a user notices. Frontier models instead rewrite, reword, and subtly alter meaning while keeping the surface fluent and plausible Do frontier models fail differently than weaker models?. A document that still reads smoothly and looks complete gives the reader no signal that something is wrong, so the human skim that would catch a deletion sails right past a corruption.
It's tempting to assume better tools or an agentic editing interface would catch this, but the failure is upstream of the tools. Giving the model richer editing capabilities doesn't improve reliability, because the error originates in the model's *judgment about what to change*, not in how it executes the change Can better tools fix LLM document editing errors?. Two deeper mechanisms feed this. First, errors are self-amplifying: once a mistake enters the context history, it biases everything downstream, producing non-linear degradation that scaling alone doesn't fix Do models fail worse when their own errors fill the context?. Second, the things we normally use to check work — final outputs — are the wrong place to look. Most failures in long traces are violations *during* the process, invisible to anyone scoring only the end result; intermediate verification raised task success from 32% to 87% precisely because it catches what final-answer checking misses Where do reasoning agents actually fail during long traces?.
There's a broader pattern here worth sitting with. Fluency is a poor proxy for correctness, and our detection instincts are calibrated to fluency. LLM judges fall for the same trick — they score responses higher when they carry authoritative references or rich formatting, independent of whether the content is actually good Can LLM judges be tricked without accessing their internals?. Whether the evaluator is a human skimming a polished document or another model grading output, surface competence masks substantive error. The frontier model's growing skill at producing convincing prose is the very thing that makes its errors harder to see.
If there's a doorway out, it points toward grounded refusal and process-level checking rather than trusting the final artifact: systems that constrain generation to what's verifiably supported, and refuse rather than confabulate, trade coverage for integrity Can RAG systems refuse to answer without reliable evidence?. The unsettling takeaway: as models get better at the surface, the burden of verification shifts away from "does this look right" toward "can I prove each change was warranted" — and most users have no way to do the latter.
Sources 7 notes
Testing 19 models across 52 domains shows even advanced systems degrade documents by ~25% over extended relay tasks, with errors compounding silently without plateauing through 50 round-trips.
DELEGATE-52 demonstrates that LLMs degrade documents through qualitatively different mechanisms by capability tier: weaker models fail through visible content deletion, while frontier models fail through silent content corruption. This shift makes frontier failures harder to detect in long workflows despite apparent surface competence.
DELEGATE-52 shows that agentic tool access fails to improve performance on long-horizon document tasks. The degradation mechanism originates upstream in the model's judgment about what to change, not in editing interface limitations.
Error accumulation in context causes non-linear performance degradation in long-horizon tasks. Model scaling does not fix this; only test-time compute through thinking models reduces the effect by preventing error-contaminated context from biasing reasoning.
Reliability for long-trace reasoning comes from checking intermediate states and policy compliance during generation, not from scoring final outputs. Adding intermediate verification raised task success from 32% to 87% because most failures are process violations, not wrong answers.
Research shows LLM evaluators systematically score higher when responses include fake references or rich formatting, independent of content quality. These biases are exploitable without model access, undermining AI benchmark credibility.
A multilingual RAG system for noisy historical newspapers succeeds by aggressively expanding retrieval while constraining generation to only grounded answers. The grounded-refusal prompt prevents hallucination when OCR errors and language drift degrade source quality, trading coverage for integrity.