INQUIRING LINE

How does error accumulation in workflows scale across multiple model calls?

This explores what happens to errors when a task is split across many model calls — whether mistakes stay isolated or compound as the chain gets longer.


This explores how errors behave when work is chained across many model calls — and the corpus is fairly blunt: errors don't just add up, they feed on themselves. The starkest finding is that frontier models silently corrupt about 25% of document content across long delegated relay tasks, and crucially the damage *doesn't plateau* — it keeps compounding through 50 round-trips without the model noticing Do frontier LLMs silently corrupt documents in long workflows?. The mechanism behind this is what one note calls the self-conditioning effect: once a model's own earlier mistakes are sitting in its context window, they bias the next step, producing non-linear degradation rather than a steady drip Do models fail worse when their own errors fill the context?. So the answer to 'how does it scale' is: worse than linearly, because each call inherits the contaminated output of the last.

What's surprising is what *doesn't* fix it. Making the model bigger doesn't help — scaling fails to address self-conditioning, and only test-time compute (thinking models that reason before committing) reduces the effect by keeping error-poisoned context from steering the next move Do models fail worse when their own errors fill the context?. There's even a counterintuitive failure where training on *correct* code trajectories teaches models to tolerate the errors they passed through along the way, which is why some methods deliberately filter trajectories asymmetrically Why do correct code trajectories teach models to tolerate errors?.

The most interesting cross-current is that the architecture of the workflow matters more than the strength of any single call. One line of work shows you can run *million-step* tasks with essentially zero accumulated error — but only by decomposing into the smallest possible subtasks and voting at each step to catch mistakes before they propagate. The twist: small, non-reasoning models suffice when the decomposition is extreme enough, which inverts the usual instinct to throw a bigger model at a hard problem Can extreme task decomposition enable reliable execution at million-step scale?. The corpus repeatedly favors small models doing narrow, well-defined steps over one large model carrying a long chain Can small language models handle most agent tasks?.

If the disease is silent compounding, the treatment the corpus points to is verification *during* the chain, not after it. Checking intermediate reasoning states rather than just the final answer lifted task success from 32% to 87%, because most failures turn out to be process violations that final-answer scoring never sees Where do reasoning agents actually fail during long traces?. Step-level confidence catches breakdowns that averaging across the whole trace masks, and lets you stop early before a bad trace burns more calls Does step-level confidence outperform global averaging for trace filtering?. And in multi-agent setups the errors take on distinct shapes — role flipping, infinite loops, conversation drift — that stem from agents lacking a stable goal across turns Why do autonomous LLM agents fail in predictable ways?.

The thing you might not have known you wanted to know: error accumulation isn't really a property of the model, it's a property of how you wire the calls together. The same base model that quietly corrupts a quarter of a document in a long relay can run flawlessly for a million steps — the difference is whether the workflow lets errors re-enter the context or catches and votes them out at each hop.


Sources 8 notes

Do frontier LLMs silently corrupt documents in long workflows?

Testing 19 models across 52 domains shows even advanced systems degrade documents by ~25% over extended relay tasks, with errors compounding silently without plateauing through 50 round-trips.

Do models fail worse when their own errors fill the context?

Error accumulation in context causes non-linear performance degradation in long-horizon tasks. Model scaling does not fix this; only test-time compute through thinking models reduces the effect by preventing error-contaminated context from biasing reasoning.

Why do correct code trajectories teach models to tolerate errors?

GRPO-RoC filters positive trajectories for quality while preserving diverse failures as negative signal, allowing a 14B model to reach frontier math performance in 510 RL steps, surpassing much larger models with cleaner reasoning.

Can extreme task decomposition enable reliable execution at million-step scale?

MAKER solves million-step tasks with zero errors by decomposing into minimal subtasks, applying voting at each step, and flagging correlated errors. Surprisingly, small non-reasoning models suffice when decomposition is extreme enough, inverting the standard approach to hard problems.

Can small language models handle most agent tasks?

SLMs handle the repetitive, well-defined language tasks that constitute most agent work at 10–30× lower cost than LLMs, making heterogeneous architectures (SLMs by default, LLMs selective) the economically rational design pattern.

Where do reasoning agents actually fail during long traces?

Reliability for long-trace reasoning comes from checking intermediate states and policy compliance during generation, not from scoring final outputs. Adding intermediate verification raised task success from 32% to 87% because most failures are process violations, not wrong answers.

Does step-level confidence outperform global averaging for trace filtering?

Local step-level confidence catches reasoning breakdowns that global averaging masks and enables early stopping before traces complete. This approach achieves comparable accuracy gains to naive majority voting with far fewer generated traces, proving trace quality matters more than quantity.

Why do autonomous LLM agents fail in predictable ways?

Research identifies role flipping, flake replies, infinite loops, and conversation deviation as LLM-specific failures in multi-agent cooperation. These occur because LLMs lack persistent goal representation and stable role identity.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question remains open: **How does error accumulation in workflows scale across multiple model calls, and what architectural choices prevent or amplify it?**

What a curated library found — and when (dated claims, not current truth):
Findings span May 2025–April 2026. A library of agentic AI papers reports:
- Frontier models silently corrupt ~25% of document content over long relay tasks; damage compounds non-linearly across 50+ round-trips without the model detecting it (2026-04).
- Self-conditioning effect: prior errors in context history amplify future error rates; scaling model size does not mitigate this; only test-time reasoning (thinking models) reduces compounding by decoupling error-poisoned context from next steps (2025–2026).
- Extreme task decomposition into microagents with per-step voting enables million-step execution with zero accumulated error; small non-reasoning models suffice when subtask boundaries are tight enough (2025-11).
- Step-level confidence-aware filtering lifts task success from 32% to 87% by catching process violations before final answer; global confidence averaging masks breakdowns (2025-08).
- Multi-agent failures take distinct shapes (role flipping, infinite loops, goal drift) stemming from unstable objectives across turns, not inherent model weakness (2025-08).

Anchor papers (verify; mind their dates):
- 2604.15597: LLMs Corrupt Your Documents When You Delegate
- 2511.09030: Solving a Million-Step LLM Task with Zero Errors
- 2508.13143: Exploring Autonomous Agents: A Closer Look at Why They Fail
- 2508.15260: Deep Think with Confidence

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For the 25% corruption claim, self-conditioning amplification, and the million-step zero-error finding: has newer work (last 6 months) reported stronger baseline models, architectural changes (e.g., improved memory/caching, native long-context handling), or orchestration patterns that relax these limits? Separate the durable tension (does error feedback loop remain real?) from the perishable limitation (is decomposition + voting still necessary, or have foundation models developed inherent self-correction?). Cite what changed.
(2) **Surface the strongest CONTRADICTING work.** Find papers arguing compounding is NOT the dominant failure mode, or that larger models actually *do* resist self-conditioning, or that end-to-end training on agentic tasks eliminates the need for intermediate verification. Flag disagreement explicitly.
(3) **Propose 2 research questions assuming the regime may have shifted:** e.g., "If newer models show robustness to error feedback loops, is the gain from test-time reasoning now only marginal?" or "Does native long-context (e.g., 1M tokens) replace voting as the error-control method of choice?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines