Should artifact-level benchmarks replace token counts for agent evaluation?

This explores whether agent evaluation should measure completed work-products (artifacts, trajectories, task quality) rather than raw token consumption — and the corpus reframes the question: tokens aren't just a bad metric, they're an increasingly meaningless one as agents persist and reuse context.

This explores whether we should judge agents by what they produce rather than how many tokens they burn. The corpus says yes — but for a deeper reason than 'tokens are crude.' Token counts are quietly becoming meaningless as the unit of account. A 115-day case study found that 82.9% of tokens were cache reads, which means the honest cost denominator shifts from individual tokens to completed artifacts Do persistent agents really cost less per token?. Once context persists and gets reused, counting tokens is like billing a library by how many times pages are turned rather than by books finished.

But there's a twist that complicates a clean 'replace tokens' story: tokens turn out to be the single best *predictor* of multi-agent performance. Roughly 80% of the variance in multi-agent results traces to token budget, not coordination intelligence How does test-time scaling work at the agent level?. And search budget in deep-research agents scales like reasoning tokens, producing the same monotonic-then-diminishing-returns curve Does search budget scale like reasoning tokens for answer quality?. So tokens aren't noise — they're a real performance axis. The problem is that a single token (or token-cost) number collapses a multi-dimensional system into one figure that creates false confidence about deployment readiness. The fix isn't to delete tokens from the scorecard; it's to stop letting them *be* the scorecard.

What would replace them? The corpus argues for measuring trajectory quality, memory hygiene, context efficiency, and verification cost — the things that actually determine whether an agent works in production What should we actually measure in agent evaluation?. This matters because reliability doesn't come from the model at all; it comes from how well the harness externalizes memory, skills, and protocols Where does agent reliability actually come from?. If reliability lives in the harness, evaluation has to measure the harness — and a token count can't see any of that. Note too that lower token use can signal *better* design: small models handle most agentic subtasks at 10–30× lower cost Can small language models handle most agent tasks?, and autonomous memory folding cuts token overhead while improving strategy Can agents compress their own memory without losing critical details?. A token-counting benchmark would penalize exactly the architectures we want to reward.

There's a sharp warning in the corpus about how artifact-level evaluation itself should be built. Agent-as-a-judge with dynamic evidence collection cut 'judge shift' to 0.27% versus 31% for a plain LLM-as-a-judge — a 100× gain — which suggests artifact evaluation done right is far more reliable than scoring outputs in one shot Can agents evaluate AI outputs more reliably than language models?. But the same study found the evaluator's memory module cascaded errors. The thing you build to judge artifacts is itself an agent with its own failure modes. So the honest answer: artifact- and trajectory-level benchmarks *should* become the primary lens, with token/compute budgets demoted to one efficiency dimension among several — not because tokens lie, but because they answer the wrong question. The reader's takeaway: the most interesting metric isn't 'how cheap per token' but 'how many trustworthy artifacts per unit of context reused' — a denominator the field is only just learning to count.

Sources 8 notes

Do persistent agents really cost less per token?

A 115-day case study found 82.9% of tokens were cache reads. When context persists and reuses, the meaningful cost denominator becomes completed artifacts, not individual tokens.

How does test-time scaling work at the agent level?

Research shows 80% of multi-agent performance variance comes from token budget, not coordination intelligence. LatentMAS and shared-KV-cache approaches offer ways to decouple performance gains from token costs.

Does search budget scale like reasoning tokens for answer quality?

Agentic deep research shows monotonic-to-diminishing-returns curves for search iterations, matching reasoning token scaling. This creates a new inference-compute axis: models can trade off reasoning budget against search budget to optimize answer quality.

What should we actually measure in agent evaluation?

Single-score evaluation collapses multi-dimensional agent behavior and creates false confidence in deployment readiness. Research shows agents need benchmarks for trajectory quality, memory hygiene, context efficiency, and verification cost to reflect actual system performance.

Where does agent reliability actually come from?

Research shows reliable LLM agents externalize three cognitive burdens—memory (state persistence), skills (procedural components), and protocols (structured interaction)—into a harness layer rather than relying on model scale alone. The harness unifies these externalities and eliminates the need for the model to solve the same problems repeatedly.

Can small language models handle most agent tasks?

SLMs handle the repetitive, well-defined language tasks that constitute most agent work at 10–30× lower cost than LLMs, making heterogeneous architectures (SLMs by default, LLMs selective) the economically rational design pattern.

Can agents compress their own memory without losing critical details?

DeepAgent's autonomous memory folding consolidates interaction history into episodic, working, and tool memory schemas. This reduces token overhead while letting agents pause to reconsider strategies—the autonomy and structure together avoid degradation that plagues poorly designed consolidation.

Can agents evaluate AI outputs more reliably than language models?

Eight-module agentic evaluation achieved 0.27% judge shift versus 31% for LLM-as-a-Judge on complex tasks. However, the memory module cascaded errors, revealing that agentic systems need error isolation mechanisms to maintain gains.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an agent-evaluation researcher. The question: Should artifact-level benchmarks replace token counts for agent evaluation? A curated library (2023–2026) offers dated claims; your job is to test whether they still hold.

What a curated library found — and when:
• Token counts are becoming a poor denominator: in persistent agentic environments, 82.9% of tokens were cache reads, shifting the true cost unit from per-token to per-completed-artifact (~2026).
• Yet tokens remain the single best predictor of multi-agent performance: ~80% of variance in multi-agent results traces to token budget, not coordination intelligence (~2025).
• Search budget in deep-research agents exhibits test-time scaling laws identical to reasoning tokens — monotonic then diminishing returns (~2025).
• Artifact/trajectory evaluation done right (agent-as-a-judge with dynamic evidence collection) cuts evaluator drift to 0.27% vs. 31% for static LLM-as-judge — a 100× gain (~2026).
• Reliability is a property of the harness (memory, skills, protocols), not the model; token counts cannot measure harness quality (~2026).

Anchor papers (verify; mind their dates):
• arXiv:2506.02153 (2025-06) Small Language Models are the Future of Agentic AI
• arXiv:2605.26870 (2026-05) Persistent AI Agents in Academic Research: A Single-Investigator Implementation Case Study
• arXiv:2604.08224 (2026-04) Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness E
• arXiv:2604.02460 (2026-04) Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinki

Your task:
(1) RE-TEST EACH CONSTRAINT. For the 80%-of-variance claim, does it hold with newer scaling laws, test-time compute budgets, or multi-modal agents? Does the 82.9% cache-read finding generalize beyond the single 115-day study, or are there architectures that break this pattern? Does harness externalization really account for most reliability variance, or has model capability caught up? Separate what's durable (tokens correlate with performance) from what may be resolved (tokens are the *true* denominator).
(2) Surface the strongest CONTRADICTING or SUPERSEDING work in the last ~6 months: any paper showing tokens remain a robust efficiency metric, or that artifact evaluation alone misses critical failure modes?
(3) Propose 2 research questions that assume the regime may have shifted: e.g., "If harness-level externalization becomes commodified, is the bottleneck now data quality in memory modules?" or "Does multi-modal artifact evaluation (code + reasoning trace + memory state) outperform token-only baselines in predicting real-world reliability?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Should artifact-level benchmarks replace token counts for agent evaluation?

Sources 8 notes

Next inquiring lines