SYNTHESIS NOTE
Agentic Systems and Tool Use Training, RL, and Test-Time Scaling

Do short benchmarks predict how models perform over long workflows?

Standard LLM benchmarks measure single-turn performance, but real workflows involve sustained delegation across many turns. The question explores whether top benchmark performers maintain accuracy through longer interaction chains.

Synthesis note · 2026-05-18 · sourced from Flaws

Most LLM benchmarks evaluate single-turn or short-multi-turn interaction. DELEGATE-52 extends evaluation to 50-round-trip relays and finds that short-interaction performance is not predictive of how the same model behaves under sustained delegation. Models that perform comparably on a single edit can diverge dramatically by relay 25.

This is a methodological finding, not a model finding. The standard practice — pick the top scorer on benchmark X, deploy it in workflow Y — implicitly assumes that capability is roughly stationary across interaction lengths. The relay results show the assumption fails. Models exhibit a degradation curve, and that curve has its own shape parameters (slope, decay rate, recovery behavior under interrupted sessions) that benchmarks built for short tasks cannot expose.

The implication is that "long-horizon performance" deserves status as a distinct evaluation axis, not as a property to be inferred from single-step competence. A model with strong relay-50 retention but mediocre single-turn polish may be more useful for delegated work than the inverse. The paper argues this directly: capability research has been investing heavily in memory management while leaving the underlying long-interaction degradation profile under-measured.

For practitioners, this changes the deployment question from "which model scores highest on X" to "which model maintains accuracy through the interaction length my workflow requires." For benchmark designers, it argues for relay-style evaluations as a default rather than an add-on.

Inquiring lines that use this note as a source 15

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
13 direct connections · 133 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

short-interaction LLM benchmarks do not predict long-horizon delegated-workflow performance