SYNTHESIS NOTE
Reasoning, Retrieval, and Evaluation Training, RL, and Test-Time Scaling Model Architecture and Internals

Can retrieval-augmented language models forecast like human experts?

Can language models augmented with search and reasoning match or exceed the forecasting accuracy of competitive human crowd forecasters on events beyond their training data? This tests whether AI can handle genuine predictive judgment.

Synthesis note · 2026-06-03 · sourced from Reasoning Logic Internal Rules

Judgmental forecasting — assigning probabilities to future events from judgment, domain knowledge, and reasoning under distributional shift — is where humans have historically beaten statistical models, and where competitive forecasters set a high bar. This work builds a retrieval-augmented LM system that searches for relevant information, generates forecasts, and aggregates predictions, evaluated on a large dataset of questions from real forecasting competitions, tested only on questions published after the models' knowledge cutoffs (so the answers can't be memorized). The result: on average the system nears the crowd aggregate of competitive forecasters, and in some settings surpasses it — the first ML system to forecast at near-human levels. Two design pieces matter: a novel LM-driven retrieval mechanism that decides what to source and how to evaluate relevance, and a self-supervised finetuning method to generate reasonings with accurate predictions.

The keeper is twofold: scalable near-human forecasting is now feasible, and newer model generations forecast better naturally — capability rises with the base model without forecasting-specific tricks.

This is the foundational human-level-forecasting result the vault's forecasting cluster builds on. It underpins Can language models beat human venture capital experts? (VCBench) and Can LLMs actually forecast time series better than we think?, and pairs with the contamination-defense of Can live benchmarks prevent contamination in prediction tasks? (post-cutoff test sets are the same leak-proofing principle).

Inquiring lines that use this note as a source 8

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
12 direct connections · 107 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

a retrieval-augmented LM system forecasts future events near the level of competitive human crowd forecasters