Should artifact-level benchmarks replace token counts for agent evaluation?
This explores whether agent evaluation should measure completed work-products (artifacts, trajectories, task quality) rather than raw token consumption — and the corpus reframes the question: tokens aren't just a bad metric, they're an increasingly meaningless one as agents persist and reuse context.
This explores whether we should judge agents by what they produce rather than how many tokens they burn. The corpus says yes — but for a deeper reason than 'tokens are crude.' Token counts are quietly becoming meaningless as the unit of account. A 115-day case study found that 82.9% of tokens were cache reads, which means the honest cost denominator shifts from individual tokens to completed artifacts Do persistent agents really cost less per token?. Once context persists and gets reused, counting tokens is like billing a library by how many times pages are turned rather than by books finished.
But there's a twist that complicates a clean 'replace tokens' story: tokens turn out to be the single best *predictor* of multi-agent performance. Roughly 80% of the variance in multi-agent results traces to token budget, not coordination intelligence How does test-time scaling work at the agent level?. And search budget in deep-research agents scales like reasoning tokens, producing the same monotonic-then-diminishing-returns curve Does search budget scale like reasoning tokens for answer quality?. So tokens aren't noise — they're a real performance axis. The problem is that a single token (or token-cost) number collapses a multi-dimensional system into one figure that creates false confidence about deployment readiness. The fix isn't to delete tokens from the scorecard; it's to stop letting them *be* the scorecard.
What would replace them? The corpus argues for measuring trajectory quality, memory hygiene, context efficiency, and verification cost — the things that actually determine whether an agent works in production What should we actually measure in agent evaluation?. This matters because reliability doesn't come from the model at all; it comes from how well the harness externalizes memory, skills, and protocols Where does agent reliability actually come from?. If reliability lives in the harness, evaluation has to measure the harness — and a token count can't see any of that. Note too that lower token use can signal *better* design: small models handle most agentic subtasks at 10–30× lower cost Can small language models handle most agent tasks?, and autonomous memory folding cuts token overhead while improving strategy Can agents compress their own memory without losing critical details?. A token-counting benchmark would penalize exactly the architectures we want to reward.
There's a sharp warning in the corpus about how artifact-level evaluation itself should be built. Agent-as-a-judge with dynamic evidence collection cut 'judge shift' to 0.27% versus 31% for a plain LLM-as-a-judge — a 100× gain — which suggests artifact evaluation done right is far more reliable than scoring outputs in one shot Can agents evaluate AI outputs more reliably than language models?. But the same study found the evaluator's memory module cascaded errors. The thing you build to judge artifacts is itself an agent with its own failure modes. So the honest answer: artifact- and trajectory-level benchmarks *should* become the primary lens, with token/compute budgets demoted to one efficiency dimension among several — not because tokens lie, but because they answer the wrong question. The reader's takeaway: the most interesting metric isn't 'how cheap per token' but 'how many trustworthy artifacts per unit of context reused' — a denominator the field is only just learning to count.
Sources 8 notes
A 115-day case study found 82.9% of tokens were cache reads. When context persists and reuses, the meaningful cost denominator becomes completed artifacts, not individual tokens.
Research shows 80% of multi-agent performance variance comes from token budget, not coordination intelligence. LatentMAS and shared-KV-cache approaches offer ways to decouple performance gains from token costs.
Agentic deep research shows monotonic-to-diminishing-returns curves for search iterations, matching reasoning token scaling. This creates a new inference-compute axis: models can trade off reasoning budget against search budget to optimize answer quality.
Single-score evaluation collapses multi-dimensional agent behavior and creates false confidence in deployment readiness. Research shows agents need benchmarks for trajectory quality, memory hygiene, context efficiency, and verification cost to reflect actual system performance.
Research shows reliable LLM agents externalize three cognitive burdens—memory (state persistence), skills (procedural components), and protocols (structured interaction)—into a harness layer rather than relying on model scale alone. The harness unifies these externalities and eliminates the need for the model to solve the same problems repeatedly.
SLMs handle the repetitive, well-defined language tasks that constitute most agent work at 10–30× lower cost than LLMs, making heterogeneous architectures (SLMs by default, LLMs selective) the economically rational design pattern.
DeepAgent's autonomous memory folding consolidates interaction history into episodic, working, and tool memory schemas. This reduces token overhead while letting agents pause to reconsider strategies—the autonomy and structure together avoid degradation that plagues poorly designed consolidation.
Eight-module agentic evaluation achieved 0.27% judge shift versus 31% for LLM-as-a-Judge on complex tasks. However, the memory module cascaded errors, revealing that agentic systems need error isolation mechanisms to maintain gains.