Do newer language model generations improve forecasting ability without additional training?

This explores whether forecasting skill rides along for free as labs ship newer base models — and the corpus says yes, but with a twist that's more interesting than the question expects.

This explores whether forecasting ability is something newer model generations simply inherit, no fine-tuning required. The short answer the corpus gives is yes — but it buries a more useful finding underneath. The clearest direct evidence comes from a retrieval-augmented forecasting system that reached near-parity with competitive human forecasters on real questions published *after* the models' training cutoffs, sometimes beating the crowd, and where newer model generations improved forecasting accuracy without any domain-specific tuning Can retrieval-augmented language models forecast like human experts?. So generational lift is real: the same scaffold gets sharper as the underlying model gets better.

But the corpus keeps pointing past raw model strength toward *how you structure the task*. One line of work finds that LLMs have far stronger intrinsic forecasting ability than benchmarks suggest — but only when the workflow separates numerical reasoning from contextual reasoning; ask a model to do both at once in a single prompt and the capability stays hidden Can LLMs actually forecast time series better than we think?. A related system, Nexus, beats both pure time-series models and plain LLMs by splitting forecasting into distinct stages — contextualize, then make a macro/micro outlook, then synthesize Can decomposing forecasting into stages unlock numerical and contextual reasoning?. The implication: a lot of what looks like "this generation can't forecast" is actually "nobody decomposed the task." Architecture can unlock more than a model upgrade.

There's also a domain-dependence the question hides. In fields where human experts only modestly beat chance — venture capital founder-success prediction, for instance — even raw, untuned LLM capability clears the human bar, with one model hitting 6× the market-index precision Can language models beat human venture capital experts?. And in forward-looking scientific prediction, the very pattern-completion habit that produces hallucination on backward-looking retrieval becomes genuine foresight: fine-tuned models out-predicted neuroscientists on which experiments actually replicated Can LLMs predict novel scientific results better than experts?. Forecasting, in other words, may be less a special skill and more a reframing of what these models already do.

The counterweight is worth knowing before you bet on "just wait for the next model." Scaling isn't a universal solvent: on genuine constrained-optimization tasks LLMs plateau at 55–60% regardless of parameter count or training regime, a ceiling rather than a gap Do larger language models solve constrained optimization better?. Prompting and prompt optimization can only reorganize knowledge already in the training distribution — they can't inject what the model never learned Can prompt optimization teach models knowledge they lack?. And a persistently undertrained dimension is calibration: small models trained to know when to abstain can match models 10× their size on conversation forecasting, which suggests standard generational upgrades don't automatically teach a model *when to shut up* Can models learn to abstain when uncertain about predictions?.

So the honest synthesis is layered: newer generations do improve forecasting for free where the signal is already latent in their training and the task is framed to surface it — but generational lift, task decomposition, and calibration are three separate levers, and the corpus repeatedly finds the second one (how you structure the workflow) doing more work than the first one (which model you loaded).

Sources 8 notes

Can retrieval-augmented language models forecast like human experts?

A retrieval-augmented LM system achieved near-parity with competitive human forecasters on real forecasting questions published after model training cutoffs, sometimes surpassing human crowds. Newer model generations naturally improved forecasting without domain-specific tuning.

Can LLMs actually forecast time series better than we think?

LLMs have stronger intrinsic forecasting ability than recognized, but only when workflows separate numerical reasoning from contextual reasoning. Monolithic prompting obscures this capability; structured decomposition surfaces it.

Can decomposing forecasting into stages unlock numerical and contextual reasoning?

Nexus outperforms pure TSFM and LLM baselines on real-world datasets by decomposing forecasting into contextualization, dual-resolution macro/micro outlook, and synthesis stages. Separating numerical extrapolation from event-driven contextual reasoning avoids forcing one model to handle both simultaneously.

Can language models beat human venture capital experts?

VCBench shows several LLMs exceed human baselines in founder-success prediction, with DeepSeek-V3 achieving 6× market-index precision. In sparse-signal forecasting where experts only modestly beat chance, even raw LLM capability suffices to clear the human bar.

Can LLMs predict novel scientific results better than experts?

BrainBench benchmarks show fine-tuned LLMs outperform neuroscience experts at predicting which experimental results actually occurred. The same pattern-integration tendency that causes hallucination in retrieval tasks enables genuine prediction in forward-looking scenarios.

Do larger language models solve constrained optimization better?

Across constrained-optimization tasks, LLMs converge to ~55–60% constraint satisfaction independent of architecture, parameter count, or training regime. Reasoning models do not systematically outperform standard models, suggesting a fundamental ceiling rather than a scaling gap.

Can prompt optimization teach models knowledge they lack?

Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.

Can models learn to abstain when uncertain about predictions?

Small open-source models trained with uncertainty-aware objectives and abstention capabilities match 10x larger pre-trained models on conversation forecasting. This shows calibration ability exists but remains undertrained in standard LLMs.

Do newer language model generations improve forecasting ability without additional training?

Sources 8 notes

Next inquiring lines