Should long horizon performance be measured as a separate evaluation axis?

This explores whether 'can the model hold up over long, multi-step tasks' deserves its own slot in evaluation — separate from the usual single-task success score — rather than being folded into one number.

This explores whether long-horizon performance — how well a model holds together across many steps or a sustained delegated task — should be measured on its own axis rather than absorbed into a single benchmark score. The corpus answers fairly emphatically: yes, because short interactions simply don't predict long ones. DELEGATE-52 ran models across 50-round-trip relays and found that single-turn rankings collapsed by relay 25 — models that looked equivalent on standard benchmarks diverged into wildly different degradation curves Do short benchmarks predict how models perform over long workflows?. If short-task scores can't forecast sustained performance, then long-horizon ability isn't a finer-grained version of the same thing; it's a different quantity that needs its own measurement.

The stronger version of the argument is that capability isn't a scalar at all — it's a vector. One note decomposes agent capability into at least five separable axes (task success, privacy compliance, long-horizon retention, mode-shift behavior, ecosystem readiness), and notes that models topping one axis routinely rank low on others, which makes any single composite score systematically misleading Does a single benchmark score actually predict agent readiness?. Long-horizon retention is one named coordinate in that vector — so the question 'should it be separate?' is really a special case of 'should we stop collapsing multi-dimensional behavior into one number?' A companion note pushes the same way, arguing evaluation should measure trajectory quality, memory hygiene, context efficiency, and verification cost rather than just whether the final answer was right What should we actually measure in agent evaluation?.

Here's the twist a curious reader might not expect: making long-horizon a separate axis doesn't solve the measurement problem, it relocates it. Once you score whole trajectories instead of one-shot answers, the old headaches — comparability, reproducibility, mapping evidence to a judgment — don't disappear; they reappear in higher-dimensional space and arguably get harder Do interactive evaluations actually solve the benchmark comparison problem?. So 'add an axis' is necessary but not sufficient; without shared protocols, trajectory scores are just noisier numbers.

There's also a deeper challenge to the whole framing. One case study built on 75,671 telemetry records argues the real unit of evaluation isn't the model or even the episode but the coupled human-agent-environment — because the capability gains that matter accumulate across sessions through reusable procedures and built-up context that no single-trajectory test can see Should we evaluate deployed agents as whole environments instead?. By that logic, 'long horizon' might not be one axis to bolt on but the thing that dissolves the episode as a unit entirely. This connects to how the field is rethinking memory itself: rather than the old short-term/long-term split, a 2025 survey reframes agent memory along forms, functions, and dynamics — treating temporal span as an emergent property rather than an architectural category Can three axes replace the short-term long-term memory split?.

Worth flagging a measurement trap from an adjacent corner: the exploration-exploitation 'trade-off' in RLVR turned out to be an artifact of measuring at the token level, vanishing under hidden-state analysis Is the exploration-exploitation trade-off actually fundamental?. The cautionary lesson for long-horizon eval is that the axis you add is only as honest as the unit you measure at — pick the wrong granularity and you'll manufacture a phenomenon that isn't there. The corpus's consensus: long-horizon performance does deserve a separate axis, but the harder, more interesting work is agreeing on what unit and what protocol you measure it with.

Sources 7 notes

Do short benchmarks predict how models perform over long workflows?

DELEGATE-52 evaluated models across 50-round-trip relays and found short-interaction performance does not predict sustained delegation accuracy. Models ranking similarly on single-turn tasks diverged dramatically by relay 25, revealing degradation curves invisible to standard benchmarks.

Does a single benchmark score actually predict agent readiness?

Capability decomposes into task success, privacy compliance, long-horizon retention, mode-shift behavior, and ecosystem readiness. Models ranked highest on one axis often rank lower on others, making single-score evaluations systematically misleading for real deployment.

What should we actually measure in agent evaluation?

Single-score evaluation collapses multi-dimensional agent behavior and creates false confidence in deployment readiness. Research shows agents need benchmarks for trajectory quality, memory hygiene, context efficiency, and verification cost to reflect actual system performance.

Do interactive evaluations actually solve the benchmark comparison problem?

Interactive evaluation relocates core problems—comparability, reproducibility, evidence-to-judgment mapping—into higher-dimensional space rather than solving them. The field needs design protocols and shared standards, not format adoption, to make trajectory scoring interpretable.

Should we evaluate deployed agents as whole environments instead?

A single-investigator case study with 75,671 telemetry records shows that capacity gains come from accumulated context and reusable procedures that only exist across sessions with human direction. Model and episode-level evaluation cannot measure these cross-session variables.

Can three axes replace the short-term long-term memory split?

A 2025 survey reframes agent memory along forms (token/parametric/latent), functions (factual/experiential/working), and dynamics (formation/evolution/retrieval), showing that short/long-term phenomena emerge from temporal patterns rather than architectural separation. This enables precise system comparison and replaces vague implementation-based claims.

Is the exploration-exploitation trade-off actually fundamental?

Hidden-state analysis using Effective Rank metrics shows near-zero correlation between exploration and exploitation, revealing the trade-off emerges only at token level. VERL demonstrates simultaneous enhancement achieving 21.4% accuracy gains on Gaokao 2024.

Should long horizon performance be measured as a separate evaluation axis?

Sources 7 notes

Next inquiring lines