Do automated benchmarks hide what frontier AI systems can really do?
Benchmarks optimize for auto-gradable, short, cheap tasks. But real AI capability emerges in long-horizon, messy, open-ended work. How much capability are we missing—or wrongly inflating—by relying on benchmark scores alone?
Benchmark-based evaluation underpins public discussion of AI progress, but it has a structural bias: constructing a benchmark requires tasks that are precisely specified, automatically verifiable, relatively easy to optimize for, and run with low budgets over short horizons. That selection both overstates capability (optimizable, gradable tasks flatter models) and understates it (real tasks that don't fit the mold go unmeasured). Decisions about funding, regulation, and safety are increasingly made on these measurements.
The proposed complement is open-world evaluation: long-horizon, messy, real-world tasks assessed through small-sample qualitative analysis rather than benchmark-scale automation. The instance is concrete — an AI agent tasked with developing and publishing an iOS app to the App Store, which it completed with a single unnecessary manual intervention, suggesting open-world evals can give early warning of capabilities about to become widespread.
The two methodological practices worth carrying forward generalize beyond the example. Invest in log analysis: agent logs contain far more than a binary outcome — how the agent decomposes problems, recovers from failure, explores solution space, and sometimes misrepresents its own progress — none recoverable from aggregate scores. Report cost as a first-class quantity: capability scales with budget, so a score without its cost is uninterpretable. This sits alongside Does a single benchmark score actually predict agent readiness? and Should interactive evaluation be designed as a unified paradigm? as part of a broader argument that aggregate benchmark numbers are the wrong instrument for frontier agents.
Inquiring lines that use this note as a source 9
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Why do estimates for task-level performance differ so much from full job automation timelines?
- Why do AI benchmarks show rapid saturation from near-zero to near-perfect?
- What capability dimension does a closed-ended exam actually fail to measure?
- How do open-world evaluations correct distortions that automated benchmarks introduce?
- Can a single axis benchmark ever represent deployment readiness accurately?
- Why do static benchmarks miss frontier capabilities that open-world tasks reveal?
- Where do frontier AI models already exceed safety thresholds in capability areas?
- How should evaluation frameworks account for the computational cost of frontier AI capability?
- What real-world tasks most clearly expose gaps between benchmark performance and actual capability?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does a single benchmark score actually predict agent readiness?
Single-axis benchmarks rank models by one capability—like task success—but ignore privacy, duration, operating mode, and ecosystem fit. Can one number really capture what matters for deployment?
both reject the single-number benchmark for frontier agents
-
Should interactive evaluation be designed as a unified paradigm?
As AI systems increasingly interact over time with tools and environments, evaluation practice must evolve. Should interactive evaluation be treated as a principled design science with shared protocols, or adopted incrementally as new benchmarks?
open-world evals are a sibling paradigm with explicit reporting norms
-
Can frontier exams really measure cutting-edge AI capability?
Popular benchmarks like MMLU saturate quickly, hiding real capability differences. Can expert-designed closed-ended exams like Humanity's Last Exam discriminate at the frontier, and what would high scores actually tell us about AI systems?
the other half: open-world evals address the messy side, frontier exams address the saturation side
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Open-World Evaluations for Measuring Frontier AI Capabilities
- FormulaOne: Measuring the Depth of Algorithmic Reasoning Beyond Competitive Programming
- Interactive Evaluation Requires a Design Science
- Gdpval: Evaluating Ai Model Performance On Real-world Economically Valuable Tasks
- LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries
- Automated Alignment Researchers: Using large language models to scale scalable oversight
- TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks
- On the Reasoning Capacity of AI Models and How to Quantify It
Original note title
open-world evaluations of messy long-horizon real tasks correct the distortions automated benchmarks introduce