SYNTHESIS NOTE

How should we evaluate agent behavior beyond final answers?

Current agent evaluation focuses on endpoint correctness, but agentic systems unfold over time through interaction trajectories. What evidence and scoring methods should we use to capture process quality, recovery, and coordination?

Synthesis note · 2026-05-28 · sourced from Evaluations

If evaluation is the map E: X → Y from admissible evidence to judgments, then the shift to agentic systems changes both terms in a parallel, recurring way. On the evidence side (X), the unit expands from a single final response to a full interaction-generated trajectory — the sequence of states, actions, tool calls, and environment responses produced as the system acts in closed loop. On the procedure side (E), final correctness is no longer sufficient; the evaluator must additionally score process quality, recoverability (can the agent get back on track after an error?), coordination (across tools, environments, other agents), robustness, efficiency, and system-level performance.

This is a pattern, not a single metric, because the same expansion recurs across otherwise unrelated agent benchmarks. T-Eval scores whether each predicted tool call matches the expected one; AgentBoard's Progress Rate compares the actual trajectory against the expected trajectory; multi-agent frameworks score collaborative efficiency and how well agents distribute tasks dynamically. Each is an instance of "stop scoring the endpoint, start scoring the path." The trajectory becomes the evidence, and the qualities that only exist over time — recovery, coordination, partial progress — become the things judged.

Why it matters: this reframes a scattered set of agent metrics as a coherent move. Once you see process-recoverability-coordination scoring as the trajectory-level analogue of final-answer scoring, you can ask the design-science questions — which artifacts to admit, how to map them to judgments — systematically rather than benchmark by benchmark. The counterpoint: richer evidence is also noisier and harder to standardize, which is precisely why the expansion creates new evaluation challenges rather than dissolving the old ones.

Inquiring lines that use this note as a source 12

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

14 direct connections · 105 in 2-hop network ·medium cluster Open in graph ↗

How should we evaluate agent behavior beyond fin… Should interactive evaluation be designed as a uni… Can trajectory structure replace hand-annotated pr… Does agent interaction time scale separately from …

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Should interactive evaluation be designed as a unified paradigm? As AI systems increasingly interact over time with tools and environments, evaluation practice must evolve. Should interactive evaluation be treated as a principled design science with shared protocols, or adopted incrementally as new benchmarks?
the paradigm whose evidence-and-procedure expansion this pattern describes concretely
Can trajectory structure replace hand-annotated process rewards? Recent methods extract step-level supervision directly from how agent trajectories are structured—trees, expert alignments, tool calls—rather than training separate reward models. Can this structural approach consistently avoid annotation costs?
operationalizes trajectory-as-evidence for training, complementing trajectory-as-evidence for evaluation
Does agent interaction time scale separately from reasoning depth? Can agents improve by taking more environment steps rather than thinking harder per step? This matters because partially observable tasks like web navigation may need exploration and backtracking that deeper reasoning alone cannot provide.
the capability side: interaction-horizon abilities are exactly what trajectory-level evaluation is needed to measure

How should we evaluate agent behavior beyond final answers?

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 4