Fine-tuning Language Models for Factuality

The fluency and creativity of large pre-trained language models (LLMs) have led to their widespread use, sometimes even as a replacement for traditional search engines. Yet language models are prone to making convincing but factually inaccurate claims, often referred to as ‘hallucinations.’ These errors can inadvertently spread misinformation or harmfully perpetuate misconceptions. Further, manual fact-checking of model responses is a time-consuming process, making human factuality labels expensive to acquire. In this work, we fine-tune language models to be more factual, without human labeling and targeting more open-ended generation settings than past work. We leverage two key recent innovations in NLP to do so. First, several recent works have proposed methods for judging the factuality of open-ended text by measuring consistency with an external knowledge base or simply a large model’s confidence scores. Second, the direct preference optimization algorithm enables straightforward fine-tuning of language models on objectives other than supervised imitation, using a preference ranking over possible model responses.
Introduction. Recent developments in training large language models (LLMs), particularly methods that learn from rankings over responses such as reinforcement learning from human feedback (RLHF) (Christiano et al., 2017; Ziegler et al., 2020; Ouyang et al., 2022), have enabled the development of powerful, engaging dialogue agents. State-of-the-art LLMs are pre-trained on a vast amount of knowledge in large datasets (Touvron et al., 2023a;b) and further fine-tuned to apply this knowledge to follow diverse instructions or complete more specific tasks (Chung et al., 2022; Chen et al., 2021). However, despite these large language models’ exposure to diverse datasets, they are prone to confidently generating incorrect claims. One recent study shows that GPT-3.5 (ChatGPT) produces false citations more often than not when asked to provide the authors of a given study (Agrawal et al., 2023).
Discussion / Conclusion. In this paper, we show a practical, effective strategy to improve a language model’s ability to generate factual content, specifically focusing on long-form generations. We develop and study two different approaches to estimating the truthfulness of long-form text and optimize for these criteria using preference-based learning. In addition to existing reference-based truthfulness estimators that leverage external knowledge to establish the truth of a particular statement, we introduce a novel reference-free procedure for estimating truthfulness that uses the language model’s own uncertainty as an indication of factuality. Our experiments show that fine-tuning a language model with either criterion reliably reduces the number of incorrect facts (i.e. hallucinations) that the model generates. Reference-free approaches like the one we have introduced provide a particularly scalable self-supervision strategy to improve factuality, eliminating the need for a reference corpus of ‘gold’ texts. The experimental results suggest a number of avenues for future work.