What reporting standards would make interactive evaluation scores comparable across benchmarks?

This explores what would have to be written down and reported — protocols, axes, evidence trails — for interactive evaluation scores to mean the same thing from one benchmark to the next, rather than each benchmark inventing its own incomparable yardstick.

This explores what would have to be reported — not just measured — for interactive evaluation scores to be comparable across benchmarks. The corpus's blunt starting point: interactive evaluation doesn't solve the comparison problem, it relocates it. The old headaches — comparability, reproducibility, the chain from evidence to judgment — don't vanish when you move from single answers to full trajectories; they reappear in higher-dimensional form, where there's even more room for two benchmarks to disagree about what they're scoring Do interactive evaluations actually solve the benchmark comparison problem?. So the reporting standards question is really: what do you have to disclose to make a trajectory score interpretable by someone who didn't design the benchmark?

The strongest answer in the corpus is that interactive evaluation has to be treated as a designed paradigm with explicit protocols and shared reporting standards, not a pile of disconnected benchmarks each adopting the format their own way. The distinction is load-bearing: design it as one system and you prevent fragmentation; let benchmarks proliferate and incomparability is baked in from the start Should interactive evaluation be designed as a unified paradigm?. Part of that standard is widening what counts as reportable evidence beyond the final response — capturing the steps, not just the verdict.

What would actually go in such a report? The corpus suggests a single number is the core problem, not the solution. Agent capability is a vector across separable axes — task success, privacy compliance, long-horizon retention, mode-shift behavior, ecosystem readiness — and models that top one axis routinely sink on another, so any benchmark reporting a single score is systematically misleading about deployment readiness Does a single benchmark score actually predict agent readiness?. The complementary view says to report the dimensions a one-shot success rate collapses: trajectory quality, memory hygiene, context efficiency, verification cost What should we actually measure in agent evaluation?. Comparability, on this reading, comes from agreeing on a shared set of axes to report — not from agreeing on one headline metric.

Two more standards fall out of the corpus once you ask what makes a reported score trustworthy. First, disclose the judge and its stability: agentic evaluation that collects evidence as it judges cut "judge shift" to 0.27% versus 31% for a plain LLM-as-judge — but the same study found its memory module quietly cascaded errors, so a reporting standard has to surface judge variance and failure isolation, not just the headline accuracy Can agents evaluate AI outputs more reliably than language models?. Second, disclose contamination provenance. A model reconstructed 54.6% of a math benchmark from partial prompts yet scored 0.0% on a benchmark released after its training cutoff — meaning a score is uninterpretable without knowing whether the test data leaked into training Does RLVR success on math benchmarks reflect genuine reasoning improvement?. Reporting the clean-versus-contaminated split is itself a comparability standard.

The through-line you might not have expected: comparability is a property of the rubric and the protocol, not of the score. The work on rubrics shows why — using a rubric as a hard accept/reject gate behaves very differently from converting the same rubric into a dense numeric reward, and conflating the two invites reward-hacking Can rubrics and dense rewards work together without hacking?. The lesson generalizes to evaluation reporting: two benchmarks can use "the same" rubric and still be incomparable if one gates and the other scores. The standards that would make interactive scores comparable, then, are mostly about disclosure — the axes measured, the judge and its drift, the contamination status, and exactly how the rubric was applied — rather than about everyone converging on one magic number.

Sources 7 notes

Do interactive evaluations actually solve the benchmark comparison problem?

Interactive evaluation relocates core problems—comparability, reproducibility, evidence-to-judgment mapping—into higher-dimensional space rather than solving them. The field needs design protocols and shared standards, not format adoption, to make trajectory scoring interpretable.

Should interactive evaluation be designed as a unified paradigm?

Interactive evaluation should be treated as a principled paradigm with explicit protocols and reporting standards, not adopted as disconnected benchmarks. The distinction matters: designing interactive evaluation as a unified system prevents fragmentation and incomparability, while expanding what counts as evidence beyond final responses.

Does a single benchmark score actually predict agent readiness?

Capability decomposes into task success, privacy compliance, long-horizon retention, mode-shift behavior, and ecosystem readiness. Models ranked highest on one axis often rank lower on others, making single-score evaluations systematically misleading for real deployment.

What should we actually measure in agent evaluation?

Single-score evaluation collapses multi-dimensional agent behavior and creates false confidence in deployment readiness. Research shows agents need benchmarks for trajectory quality, memory hygiene, context efficiency, and verification cost to reflect actual system performance.

Can agents evaluate AI outputs more reliably than language models?

Eight-module agentic evaluation achieved 0.27% judge shift versus 31% for LLM-as-a-Judge on complex tasks. However, the memory module cascaded errors, revealing that agentic systems need error isolation mechanisms to maintain gains.

Does RLVR success on math benchmarks reflect genuine reasoning improvement?

Qwen2.5-Math-7B reconstructs 54.6% of MATH-500 from partial prompts but scores 0.0% on post-release LiveMathBench, revealing dataset contamination. On clean benchmarks, only correct rewards improve performance; random and inverse rewards fail or degrade reasoning ability.

Can rubrics and dense rewards work together without hacking?

DRO shows that using rubrics to accept or reject rollout groups—rather than converting rubric scores into dense rewards—prevents reward hacking. This separation preserves the categorical strength of rubrics while letting token-level rewards optimize within valid answers.

What reporting standards would make interactive evaluation scores comparable across benchmarks?

Sources 7 notes

Next inquiring lines