Can language models match competitive crowd forecasters on real future events?

This explores whether language models can hit the accuracy of skilled human forecasting crowds on genuinely unknown future events — not hindsight tasks, but questions whose answers hadn't happened yet when the model was built. The most direct evidence says yes, nearly: a retrieval-augmented LM system reached near-parity with competitive human forecasters on real questions published *after* its training cutoff, and sometimes beat the crowd outright Can retrieval-augmented language models forecast like human experts?. The key detail isn't the headline number — it's that newer model generations improved at forecasting with no domain-specific tuning, which suggests this capability rides along with general scaling rather than needing a bespoke forecasting model.

But the corpus pushes back on reading that as raw model muscle. Several notes argue the *workflow* matters more than the model. LLMs forecast far better than people realize, but only when the pipeline separates numerical reasoning from contextual reasoning — monolithic prompting hides the ability that structured decomposition reveals Can LLMs actually forecast time series better than we think?. A multi-stage system that splits forecasting into contextualization, a macro/micro outlook, and synthesis beat both pure time-series models and plain LLMs Can decomposing forecasting into stages unlock numerical and contextual reasoning?. So 'can a language model match the crowd' is really 'can the right scaffolding around a language model match the crowd' — and the crowd itself is a kind of scaffolding (aggregating many noisy human guesses into one good one).

Where the human bar is low, the model clears it easily. LLMs beat human venture capitalists at predicting founder success — in sparse-signal domains where even experts barely outperform chance, raw capability suffices Can language models beat human venture capital experts?. They out-predict neuroscience experts on which experimental results actually occurred, and intriguingly the same pattern-blending tendency that produces hallucination on backward-looking lookup tasks becomes genuine *generalization* when the task points forward Can LLMs predict novel scientific results better than experts?. That reframes hallucination as a forecasting feature rather than only a bug.

The quieter lesson is about calibration. Matching the crowd isn't only about being right more often — it's about knowing when you don't know. Small models trained with uncertainty-aware objectives and the ability to abstain matched models ten times larger on conversation forecasting, which means calibration exists in LLMs but is undertrained by default Can models learn to abstain when uncertain about predictions?. Human forecasting crowds are good partly because they're well-calibrated; an LLM that hedges appropriately is competing on the crowd's own terms. And there are hard ceilings worth respecting — on genuine constrained-optimization problems, models plateau at 55–60% regardless of scale Do larger language models solve constrained optimization better?, a reminder that 'predicting the future' splits into tasks LLMs are quietly excellent at and tasks no amount of scaling has cracked.

One boundary worth carrying away: AI can predict social norms with superhuman accuracy yet cannot participate in *making* them Can AI predict social norms better than humans?. Forecasting is pattern-extrapolation from the outside, and that's exactly the move LLMs are good at — which is also why the crowd, made of people living inside the events they predict, remains a different kind of forecaster even when the scoreboard says it tied.

Sources 8 notes

Can retrieval-augmented language models forecast like human experts?

A retrieval-augmented LM system achieved near-parity with competitive human forecasters on real forecasting questions published after model training cutoffs, sometimes surpassing human crowds. Newer model generations naturally improved forecasting without domain-specific tuning.

Can LLMs actually forecast time series better than we think?

LLMs have stronger intrinsic forecasting ability than recognized, but only when workflows separate numerical reasoning from contextual reasoning. Monolithic prompting obscures this capability; structured decomposition surfaces it.

Can decomposing forecasting into stages unlock numerical and contextual reasoning?

Nexus outperforms pure TSFM and LLM baselines on real-world datasets by decomposing forecasting into contextualization, dual-resolution macro/micro outlook, and synthesis stages. Separating numerical extrapolation from event-driven contextual reasoning avoids forcing one model to handle both simultaneously.

Can language models beat human venture capital experts?

VCBench shows several LLMs exceed human baselines in founder-success prediction, with DeepSeek-V3 achieving 6× market-index precision. In sparse-signal forecasting where experts only modestly beat chance, even raw LLM capability suffices to clear the human bar.

Can LLMs predict novel scientific results better than experts?

BrainBench benchmarks show fine-tuned LLMs outperform neuroscience experts at predicting which experimental results actually occurred. The same pattern-integration tendency that causes hallucination in retrieval tasks enables genuine prediction in forward-looking scenarios.

Can models learn to abstain when uncertain about predictions?

Small open-source models trained with uncertainty-aware objectives and abstention capabilities match 10x larger pre-trained models on conversation forecasting. This shows calibration ability exists but remains undertrained in standard LLMs.

Do larger language models solve constrained optimization better?

Across constrained-optimization tasks, LLMs converge to ~55–60% constraint satisfaction independent of architecture, parameter count, or training regime. Reasoning models do not systematically outperform standard models, suggesting a fundamental ceiling rather than a scaling gap.

Can AI predict social norms better than humans?

GPT-4.5 outperforms all individual humans at predicting social appropriateness, yet structurally cannot enter the community processes that establish and validate norms. This reveals a critical gap between pattern-matching and authentic participation in knowledge-making.

Can language models match competitive crowd forecasters on real future events?

Sources 8 notes

Next inquiring lines