SYNTHESIS NOTE
Agentic Systems and Tool Use

What should we actually measure in agent evaluation?

Current agent benchmarks reduce performance to a single success metric, potentially hiding critical differences in how agents operate. What dimensions beyond task accuracy should evaluation frameworks capture?

Synthesis note · 2026-05-28 · sourced from Agents

Agent evaluation has inherited the model-centric habit of reducing performance to a single number: final-task success or benchmark accuracy. The "system scaling" framing argues this framing is increasingly inadequate, because agent behavior emerges from the interaction of the foundation model with a memory substrate, a context constructor, a skill-routing layer, an orchestration loop, and a verification-and-governance layer. A one-shot success score collapses all of this into a binary that hides how the agent got there. Two agents with identical task-success rates can differ enormously in how much they spent, how much context they wasted, how clean their memory stayed, and how reliably they verified their own actions.

The proposed alternative is a research agenda for harness-level benchmarks that measure trajectory quality, memory hygiene, context efficiency, communication fidelity, verification cost, and safe evolution over time. The point is that the same model "projected onto different harnesses produce qualitatively different agents" — so evaluation must measure the system, not just the model. The counterpoint is that multi-dimensional metrics are harder to optimize and compare, and task success remains the outcome users ultimately care about. But success-only scores create false confidence in deployment readiness. This matters because it tells builders what to instrument: the process, not only the outcome.

Inquiring lines that use this note as a source 51

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
16 direct connections · 111 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

agent evaluation must move beyond one-shot task success to trajectory quality memory hygiene context efficiency and verification cost