SYNTHESIS NOTE
Reasoning, Retrieval, and Evaluation Training, RL, and Test-Time Scaling Agentic Systems and Tool Use

Can frontier exams really measure cutting-edge AI capability?

Popular benchmarks like MMLU saturate quickly, hiding real capability differences. Can expert-designed closed-ended exams like Humanity's Last Exam discriminate at the frontier, and what would high scores actually tell us about AI systems?

Synthesis note · 2026-06-03 · sourced from Evaluations

When models exceed 90% on popular benchmarks like MMLU, those benchmarks stop measuring anything at the frontier — the ceiling compresses real capability differences into noise. Humanity's Last Exam (HLE) is the explicit response: 3,000 questions across dozens of subjects, built by subject-matter experts, each with an unambiguous verifiable solution that cannot be answered by quick internet retrieval. SOTA models show low accuracy and poor calibration on it, exposing a real gap to the expert human frontier.

Two qualifications make this more than a "harder benchmark" announcement. First, the authors expect rapid saturation again — benchmark history shows models leaping from near-zero to near-perfect quickly, so they anticipate >50% accuracy within a year. Difficulty buys discrimination only temporarily. Second, and more durably interesting: high HLE accuracy would demonstrate expert-level closed-ended knowledge and reasoning, but would not indicate autonomous research or creative open-ended problem-solving. HLE measures structured academic problems, not the open-world capability that actually matters for deployment.

So HLE is best read as the closed-ended counterpart to open-world evaluation. Since Do automated benchmarks hide what frontier AI systems can really do?, the two together bracket the measurement problem: frontier exams restore discrimination on verifiable knowledge, open-world evals capture the messy long-horizon capability that no closed exam can. Neither alone is sufficient, and treating a high exam score as evidence of general capability is exactly the inference the paper warns against.

Inquiring lines that use this note as a source 5

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
13 direct connections · 117 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

benchmark saturation hides frontier capability — only expert-frontier closed-ended exams discriminate yet even they miss autonomous-research ability