Can language models beat human experts in domains with sparse historical signals?
This explores whether LLMs can outperform human experts specifically where the historical record is thin or under-represented — and the corpus suggests the real bottleneck isn't expertise but how densely a domain is represented in training data.
This explores whether LLMs can beat human experts in domains where the historical signal is sparse — and the corpus reframes the question: the deciding factor isn't human-vs-machine skill, it's how well-represented the domain is in the training data. Where signal is dense, models can genuinely surpass experts. LLMs finetuned on decades of psychology experiments predict human decisions more accurately than theory-driven cognitive models built by specialists Can language models learn to model human decision making?. That's the optimistic pole — abundant, structured signal lets the model out-predict the people who study the phenomenon.
The sparse pole looks very different. On a Supreme Court overruling benchmark, models systematically degrade on older cases for one reason: the training corpus over-represents recent precedent, leaving shallow representations of historical material Why do language models struggle with historical legal cases?. This is the direct answer to the question — when the historical signal is thin, performance falls, not because the reasoning is harder but because the data was never there. There's a clean theory for why: framing LLMs as autoregressive probability machines predicts that low-probability targets are systematically harder even when they're logically trivial Can we predict where language models will fail?. Rare history is, by definition, low-probability.
The tempting fix — just feed the model the sparse evidence at inference — runs into two walls. Strong parametric priors override supplied context, so the model generates what training taught it rather than what you put in front of it Why do language models ignore information in their context?. And prompt engineering can only reorganize knowledge already in the model; it cannot inject knowledge that was absent from training in the first place Can prompt optimization teach models knowledge they lack?. Together these set a hard ceiling: in a genuinely under-represented domain, no clever prompting recovers what the corpus never contained.
So the honest answer is conditional. Models beat experts where signal is dense and they can be adapted to it — and even adaptation has a sweet spot, since domain-training techniques buy performance at the cost of hidden degradation in reasoning faithfulness and transfer How do domain training techniques actually reshape model behavior?. There's also a capacity floor: on harder representational tasks like classifying argument schemes, only the largest models clear a usable threshold while smaller ones plateau Can large language models classify argument schemes reliably?. The thing you didn't know you wanted to know: 'sparse historical signal' isn't one problem but two stacked ceilings — the data was never sampled, and the model's own priors will quietly fill the gap with the present. Beating experts in those domains may require not a better model but a better corpus.
Sources 7 notes
LLMs finetuned on psychology experiment data predict human behavior more accurately than theory-driven models in decision tasks, capture individual differences in their embeddings, and transfer learning across tasks without task-specific design.
Supreme Court overruling benchmark (236 pairs) reveals era sensitivity: models perform worse on historical cases than modern ones. Root cause is training corpus over-representation of recent cases, creating shallower representations of older precedent.
By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.
Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.
Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.
Research shows every adaptation method—from parameter-efficient tuning to knowledge graph curricula—has optimal conditions tied to specific domains. The key finding: visible benefits like performance gains often come with hidden degradation in reasoning faithfulness, capability transfer, and format flexibility.
Zero-shot prompting fails uniformly across models. Few-shot with scheme descriptions helps, but only larger models exceed F1 0.55, with Claude reaching 0.65. Smaller models plateau around 0.53, suggesting a representational capacity threshold.