Open-World Evaluations for Measuring Frontier AI Capabilities

Paper · arXiv 2605.20520 · Published May 19, 2026
LLM Evaluations and Benchmarks

Benchmark-based evaluation remains important for tracking frontier AI progress. But we argue that it can both overstate and understate real-world capability because it privileges tasks that are precisely specified, automatically graded, relatively easy to optimize for, and run with low budgets and short time horizons. We advocate for a complementary class of evaluations, which we term open-world evaluations: long-horizon, messy, real-world tasks assessed through small-sample qualitative analysis rather than benchmark-scale automation. We survey recent open-world evaluations, identify their strengths and limitations, and introduce CRUX (Collaborative Research for Updating AI eXpectations), a project for conducting such evaluations on a regular basis. As a first instance, we task an AI agent with developing and publishing a simple iOS application to the Apple App Store. The agent completed the task with a single unnecessary manual intervention, suggesting that open-world evaluations can provide early warning of capabilities that may soon become widespread. We conclude with recommendations for designing and reporting open-world evaluations.

Introduction. Tracking and predicting the capabilities of frontier AI systems is an open methodological problem, and one whose stakes rise as these systems are deployed in increasingly consequential settings. The predominant approach is benchmarking, in which agents are scored on large suites of automatically graded tasks. Benchmark-based evaluations underpin much of the public discussion about AI progress. For example, METR’s time horizon graph [1] has been widely cited by policy analysts, industry leaders, and safety-focused organizations as evidence of rapid capability growth, and has been used to argue both for accelerated deployment and for tighter oversight. Decisions about funding, regulation, and safety investment are increasingly being made on the basis of such measurements. Benchmarks, however, can both overestimate and underestimate progress. Constructing a benchmark requires tasks that are precisely specified and automatically verifiable.

Discussion / Conclusion. Drawing on the CRUX #1 experiment and related efforts surveyed in Section 2.3, we identify a set of methodological practices that we believe make open-world evaluations more informative. These practices are preliminary and will evolve as the literature matures, but we present them here to support the development of shared evaluation norms in this emerging area. Invest in log analysis. Agent logs contain substantially more information than a binary outcome, and qualitative analysis of those logs can reveal how agents decompose problems, recover from failures, explore solution spaces, and, in some cases, misrepresent their own progress. Such information is not recoverable from aggregate scores alone, and we consider its systematic analysis to be a defining feature of open-world evaluation. Measure and report cost. Agent capability on many real-world tasks continues to scale with budget. As a result, cost should be reported as a first-class quantity alongside capability outcomes.