INQUIRING LINE

How much does domain expertise actually improve human forecasting under uncertainty?

This reads the question backwards through the corpus: the library doesn't measure human expertise directly, but it repeatedly tests where machines beat experts — and those results expose exactly how thin the human expert edge gets under genuine uncertainty.


This explores how much a human expert's edge actually holds up when the future is genuinely uncertain — and the collection answers it sideways, by showing where machines now match or beat those experts. The pattern is striking: the expert advantage shrinks fastest precisely in the domains where forecasting is hardest. In founder-success and venture prediction, where signal is sparse and experts only modestly beat chance, even an untuned model clears the human bar — one system hit six times market-index precision Can language models beat human venture capital experts?. The lesson isn't that machines are brilliant; it's that human expertise under high uncertainty was never as decisive as its credentials imply.

Where does that leave the expert? The corpus suggests the human edge is real but conditional — it lives in pattern integration, not raw recall. Fine-tuned models out-predict neuroscientists on which experimental results actually occurred, and the very tendency that makes them hallucinate on backward-looking lookups becomes genuine foresight on forward-looking ones Can LLMs predict novel scientific results better than experts?. Forecasting rewards a willingness to integrate scattered cues into a guess, and that's a different muscle than knowing the literature cold. A retrieval-augmented system reached competitive human-crowd levels on real questions published after its training cutoff, sometimes beating the crowd outright Can retrieval-augmented language models forecast like human experts? — suggesting much of what we call expert judgment is recoverable from good evidence-gathering plus calibrated aggregation.

The more interesting finding is that *how* you reason under uncertainty matters more than how much you know. Forecasting performance jumps when you separate numerical extrapolation from event-driven contextual reasoning rather than forcing one judgment to do both at once Can LLMs actually forecast time series better than we think?, and decomposing the task into contextualization, a macro/micro outlook, and synthesis beats monolithic approaches Can decomposing forecasting into stages unlock numerical and contextual reasoning?. This maps onto a known human failure: experts often blur their domain knowledge into their probability estimate and get worse calibration for it. A small model explicitly trained to know when to abstain can match a model ten times its size Can models learn to abstain when uncertain about predictions? — calibration, the ability to say 'I don't know,' turns out to be the undervalued skill that raw expertise rarely supplies on its own.

There's a deeper warning here about where expertise comes from. Competence that's trained only on expert demonstrations is capped by what the curators could imagine — such systems can't learn from their own failures or generalize past the demonstrated cases Can agents learn beyond what their training data shows?. That's a mirror for human expertise too: deep domain training can lock you into the scenarios your field has already seen, which is the opposite of what uncertain forecasting demands. And even the act of acquiring domain knowledge carries hidden costs — adaptation methods that boost in-domain performance often quietly degrade reasoning faithfulness and flexibility How do domain training techniques actually reshape model behavior?. Expertise can buy depth at the price of the adaptability that forecasting the genuinely novel requires.

So the honest synthesis: domain expertise improves forecasting less than we assume, and the gap is widest exactly where it should matter most — under deep uncertainty with sparse signal. What actually moves the needle is structured reasoning, evidence retrieval, and calibrated humility about what you don't know. If you want to go deeper on the surprising flip where prediction and hallucination are the same mechanism, start with Can LLMs predict novel scientific results better than experts?; if you want the calibration angle, Can models learn to abstain when uncertain about predictions? is the doorway.


Sources 8 notes

Can language models beat human venture capital experts?

VCBench shows several LLMs exceed human baselines in founder-success prediction, with DeepSeek-V3 achieving 6× market-index precision. In sparse-signal forecasting where experts only modestly beat chance, even raw LLM capability suffices to clear the human bar.

Can LLMs predict novel scientific results better than experts?

BrainBench benchmarks show fine-tuned LLMs outperform neuroscience experts at predicting which experimental results actually occurred. The same pattern-integration tendency that causes hallucination in retrieval tasks enables genuine prediction in forward-looking scenarios.

Can retrieval-augmented language models forecast like human experts?

A retrieval-augmented LM system achieved near-parity with competitive human forecasters on real forecasting questions published after model training cutoffs, sometimes surpassing human crowds. Newer model generations naturally improved forecasting without domain-specific tuning.

Can LLMs actually forecast time series better than we think?

LLMs have stronger intrinsic forecasting ability than recognized, but only when workflows separate numerical reasoning from contextual reasoning. Monolithic prompting obscures this capability; structured decomposition surfaces it.

Can decomposing forecasting into stages unlock numerical and contextual reasoning?

Nexus outperforms pure TSFM and LLM baselines on real-world datasets by decomposing forecasting into contextualization, dual-resolution macro/micro outlook, and synthesis stages. Separating numerical extrapolation from event-driven contextual reasoning avoids forcing one model to handle both simultaneously.

Can models learn to abstain when uncertain about predictions?

Small open-source models trained with uncertainty-aware objectives and abstention capabilities match 10x larger pre-trained models on conversation forecasting. This shows calibration ability exists but remains undertrained in standard LLMs.

Can agents learn beyond what their training data shows?

Agents trained on static expert datasets cannot learn from their own failures or generalize beyond demonstrated scenarios because they never interact with environments during training. Competence is capped by what curators imagined, not by agent capacity.

How do domain training techniques actually reshape model behavior?

Research shows every adaptation method—from parameter-efficient tuning to knowledge graph curricula—has optimal conditions tied to specific domains. The key finding: visible benefits like performance gains often come with hidden degradation in reasoning faithfulness, capability transfer, and format flexibility.

Next inquiring lines