How do trajectory quality and memory hygiene differ as evaluation metrics?

This explores two different things people mean by 'evaluating an agent': trajectory quality (was the path of reasoning/actions good?) versus memory hygiene (did information stay intact as the agent carried it across time?) — and why they're not the same yardstick.

This explores two evaluation lenses that get blurred together once agents run long, multi-step tasks: trajectory quality asks whether the *path* an agent takes is sound, while memory hygiene asks whether the *information* the agent is holding stays uncorrupted as it moves through that path. They fail in different places, so measuring one tells you little about the other.

Trajectory quality is fundamentally a judgment-about-process problem, and it's harder than it looks. Moving to interactive, trajectory-level evaluation doesn't dissolve the old benchmark headaches — comparability, reproducibility, mapping evidence to a score — it just relocates them into a higher-dimensional space where two runs are never quite alike Do interactive evaluations actually solve the benchmark comparison problem?. The most useful work here gets *local*: scoring confidence step-by-step rather than averaging across the whole trace catches reasoning breakdowns that a global average smooths over, and it lets you stop a bad trajectory early Does step-level confidence outperform global averaging for trace filtering?. The lesson is that trajectory quality is about resolution — where in the path did it go right or wrong — not a single headline number.

Memory hygiene is a different kind of measurement: it's about *drift and decay* over time, and it's often silent. Frontier models corrupt roughly a quarter of document content across long delegated relay tasks, with errors compounding round after round and never plateauing — a failure you'd never catch by scoring whether each individual step "looked reasonable" Do frontier LLMs silently corrupt documents in long workflows?. A trajectory can look locally competent at every step while the underlying state quietly rots. That's the crux of the difference: trajectory quality evaluates decisions; memory hygiene evaluates preservation. A clean trajectory over corrupted memory still produces a wrong answer.

Where the two lenses meet is in how systems *treat* their own history. SkillRL shows you can't process all past episodes the same way — keeping successes as concrete demonstrations but compressing failures into abstracted lessons is both a trajectory-learning choice and a memory-hygiene choice, and uniform consolidation degrades performance Should successful and failed episodes be processed differently?. And the long-context bottleneck reframes hygiene as a *compute* cost, not a storage one: keeping memory healthy means spending compute to consolidate evicted context into internal state, with quality rising as you do more consolidation passes Is long-context bottleneck really about memory or compute?.

The surprise worth taking away: memory isn't always something a system explicitly stores and can be audited for cleanliness. RL agents have been shown to offload memory into their environment — using spatial artifacts as external scratch space — without any memory objective at all Do RL agents accidentally use environments as memory?. When memory leaks out into the environment like that, hygiene metrics that only inspect the model's internal state will miss it entirely, while a trajectory-level view might be the only place it shows up. The two metrics aren't just different — they catch each other's blind spots.

Sources 6 notes

Do interactive evaluations actually solve the benchmark comparison problem?

Interactive evaluation relocates core problems—comparability, reproducibility, evidence-to-judgment mapping—into higher-dimensional space rather than solving them. The field needs design protocols and shared standards, not format adoption, to make trajectory scoring interpretable.

Does step-level confidence outperform global averaging for trace filtering?

Local step-level confidence catches reasoning breakdowns that global averaging masks and enables early stopping before traces complete. This approach achieves comparable accuracy gains to naive majority voting with far fewer generated traces, proving trace quality matters more than quantity.

Do frontier LLMs silently corrupt documents in long workflows?

Testing 19 models across 52 domains shows even advanced systems degrade documents by ~25% over extended relay tasks, with errors compounding silently without plateauing through 50 round-trips.

Should successful and failed episodes be processed differently?

SkillRL demonstrates that treating successful episodes as concrete demonstrations and failures as abstracted lessons achieves state-of-the-art performance on complex tasks while using substantially less context than uniform approaches. The asymmetry mirrors human expert reasoning and avoids the degradation seen in uniform consolidation methods.

Is long-context bottleneck really about memory or compute?

Research shows the bottleneck is not memory capacity but the compute required to consolidate evicted context into fast weights during offline sleep phases. Performance improves with more consolidation passes, following a test-time scaling pattern on harder reasoning tasks.

Do RL agents accidentally use environments as memory?

Mathematical proof shows that environmental artifacts reduce information needed to represent history in RL agents. Path-following agents naturally develop memory-like behavior through standard reward optimization, satisfying situated cognition criteria without explicit memory objectives.

How do trajectory quality and memory hygiene differ as evaluation metrics?

Sources 6 notes

Next inquiring lines