SYNTHESIS NOTE
Agentic Systems and Tool Use

How should we evaluate agent behavior beyond final answers?

Current agent evaluation focuses on endpoint correctness, but agentic systems unfold over time through interaction trajectories. What evidence and scoring methods should we use to capture process quality, recovery, and coordination?

Synthesis note · 2026-05-28 · sourced from Evaluations

If evaluation is the map E: X → Y from admissible evidence to judgments, then the shift to agentic systems changes both terms in a parallel, recurring way. On the evidence side (X), the unit expands from a single final response to a full interaction-generated trajectory — the sequence of states, actions, tool calls, and environment responses produced as the system acts in closed loop. On the procedure side (E), final correctness is no longer sufficient; the evaluator must additionally score process quality, recoverability (can the agent get back on track after an error?), coordination (across tools, environments, other agents), robustness, efficiency, and system-level performance.

This is a pattern, not a single metric, because the same expansion recurs across otherwise unrelated agent benchmarks. T-Eval scores whether each predicted tool call matches the expected one; AgentBoard's Progress Rate compares the actual trajectory against the expected trajectory; multi-agent frameworks score collaborative efficiency and how well agents distribute tasks dynamically. Each is an instance of "stop scoring the endpoint, start scoring the path." The trajectory becomes the evidence, and the qualities that only exist over time — recovery, coordination, partial progress — become the things judged.

Why it matters: this reframes a scattered set of agent metrics as a coherent move. Once you see process-recoverability-coordination scoring as the trajectory-level analogue of final-answer scoring, you can ask the design-science questions — which artifacts to admit, how to map them to judgments — systematically rather than benchmark by benchmark. The counterpoint: richer evidence is also noisier and harder to standardize, which is precisely why the expansion creates new evaluation challenges rather than dissolving the old ones.

Inquiring lines that use this note as a source 12

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
14 direct connections · 105 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

agent evaluation expands evidence from final responses to interaction trajectories scoring process recoverability and coordination