Can Language Models Represent the Past without Anachronism?

Paper · arXiv 2505.00030 · Published April 28, 2025
Philosophy and Subjectivity

Before researchers can use language models to simulate the past, they need to understand the risk of anachronism. We find that prompting a contemporary model with examples of period prose does not produce output consistent with period style. Fine-tuning produces results that are stylistically convincing enough to fool an automated judge, but human evaluators can still distinguish fine-tuned model outputs from authentic historical text. We tentatively conclude that pretraining on period prose may be required in order to reliably simulate historical perspectives for social research.

Introduction. Since human history cannot be repeated in a lab, historical research generally works through observation rather than experiment. And observation is enough to illuminate many aspects of the past. To understand why a bill failed, we read documents and count votes. Questions about cultural change, however, may hinge on forms of influence that are hard to observe directly. It is plausible that the New Woman novel changed early twentiethcentury attitudes about marriage, but the extent of that influence remains a largely speculative question. Artificial intelligence could, in principle, cast new light on this kind of question by enabling researchers to simulate the written culture of past eras. If we want to ask how a particular group of writers affected opinions about marriage, we could fine-tune models with and without their works, and measure changes in model behavior. Sensing an opportunity, behavioral scientists have started to try to use LLMs to simulate human populations [1–14]. There are reasons to be wary of this approach. Boelaert et al.

Discussion / Conclusion. We regard the results of our preliminary experiments as both promising and sobering. Disappointingly, we do not find evidence that simply prompting state-of-the-art language models to generate text “as if” they were historically situated actors produces convincing results. Prompted generations of this type were judged to be plausible in fewer than two thirds of instances. And that number is likely an overestimate of the models’ performance, since it does not penalize cases in which the outputs included telltale disclaimers such as “in 1914, it is not yet known that x” or “I am not familiar with x as of 1914.” Such disclaimers occurred in as many as 20% of generations, depending on the model used. The poor performance of in-context learning is unfortunate, because these methods are the easiest and cheapest ones for AI-based historical research. We emphasize that we have not explored these approaches exhaustively. It may turn out that in-context learning is adequate—now or in the future—for a subset of research areas. But our initial evidence is not encouraging.