Can live benchmarks prevent contamination in prediction tasks?
Real-time benchmarks that continuously gather questions and verify outcomes could solve the data contamination problem in forecasting evaluation. This matters because leaked training data makes it impossible to know if models truly predict or merely retrieve memorized answers.
Future prediction is a hard agent task — analytical thinking, information gathering, decision under uncertainty — and until FutureX there was no large-scale benchmark for it, largely because real-time updates and timely answer-retrieval are hard to operate. FutureX's design choice is the keeper: it is a live benchmark that continuously collects questions from 195 trusted sites, gathers model predictions at each event's start date, and automatically checks actual outcomes. Being live is not a convenience — it is the contamination defense: a benchmark whose answers don't exist yet cannot leak into training data.
The capability finding across 25 models is equally clean: strong base models (e.g., DouBao-Seed1.6) handle straightforward questions, but hard open-ended prediction requires built-in search and reasoning, with deep-research and Think&Search agents (Grok-4, GPT-o4-mini) leading on the hardest tasks. Forecasting is therefore an agentic capability, not a base-model one.
This pairs directly with Batch 1's evaluation thread. Since Do automated benchmarks hide what frontier AI systems can really do?, FutureX is a concrete open-world instrument whose live-updating mechanism operationalizes the contamination-free, real-task ideal; and it complements Can frontier exams really measure cutting-edge AI capability? — where HLE restores discrimination on static knowledge, FutureX restores it on dynamic prediction. It also grounds Can LLMs actually forecast time series better than we think?: the gain comes from the search-and-reason workflow.
Inquiring lines that use this note as a source 3
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Do automated benchmarks hide what frontier AI systems can really do?
Benchmarks optimize for auto-gradable, short, cheap tasks. But real AI capability emerges in long-horizon, messy, open-ended work. How much capability are we missing—or wrongly inflating—by relying on benchmark scores alone?
FutureX is a live, contamination-free instance of the open-world evaluation ideal
-
Can frontier exams really measure cutting-edge AI capability?
Popular benchmarks like MMLU saturate quickly, hiding real capability differences. Can expert-designed closed-ended exams like Humanity's Last Exam discriminate at the frontier, and what would high scores actually tell us about AI systems?
complementary: static frontier exams vs dynamic prediction
-
Can LLMs actually forecast time series better than we think?
Explores whether language models possess stronger forecasting ability than current benchmarks suggest, and what role workflow design plays in revealing or hiding that capability.
both find forecasting gains come from agentic workflow not base model
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- FutureX: An Advanced Live Benchmark for LLM Agents in Future Prediction
- Survey on Evaluation of LLM-based Agents
- Procedural Knowledge in Pretraining Drives Reasoning in Large Language Models
- Nexus: An Agentic Framework for Time Series Forecasting
- An Investigation of Robustness of LLMs in Mathematical Reasoning: Benchmarking with Mathematically-Equivalent Transformation of Advanced Mathematical Problems
- Approaching Human-Level Forecasting with Language Models
- Interactive Evaluation Requires a Design Science
- KellyBench: Can Language Models Beat the Market?
Original note title
future-prediction benchmarks must be live and contamination-free and open-ended forecasting requires search-and-reasoning agents not base models