Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?

Paper · arXiv 2405.05904 · Published May 9, 2024
Training and Fine-Tuning

When large language models are aligned via supervised fine-tuning, they may encounter new factual information that was not acquired through pre-training. It is often conjectured that this can teach the model the behavior of hallucinating factually incorrect responses, as the model is trained to generate facts that are not grounded in its pre-existing knowledge. In this work, we study the impact of such exposure to new knowledge on the capability of the fine-tuned model to utilize its pre-existing knowledge. To this end, we design a controlled setup, focused on closed-book QA, where we vary the proportion of the fine-tuning examples that introduce new knowledge. We demonstrate that large language models struggle to acquire new factual knowledge through fine-tuning, as fine-tuning examples that introduce new knowledge are learned significantly slower than those consistent with the model’s knowledge. However, we also find that as the examples with new knowledge are eventually learned, they linearly increase the model’s tendency to hallucinate.

Introduction. Pre-training Large Language Models (LLMs) on textual corpora embeds substantial factual knowledge in their parameters (Petroni et al., 2019; AlKhamissi et al., 2022; Cohen et al., 2023), which is essential for excelling in various downstream applications. These models often require further alignment to desired behaviors, typically achieved through supervised fine-tuning on instructionfollowing tasks (Wei et al., 2022; Mishra et al., 2022) and preference learning from human feedback (Ouyang et al., 2022; Rafailov et al., 2024). In the fine-tuning phase, the model is usually trained on outputs created by human annotators or other LLMs. As a result, the model may encounter new factual information, extending beyond the knowledge it acquired during pre-training. This raises the question of how LLMs integrate new facts outside of their pre-existing knowledge. One possibility is that the model simply adapts by learning this new factual information.

Discussion / Conclusion. Practical Implications. This work highlights the risk in using supervised fine-tuning to update LLMs’ knowledge, as we present empirical evidence that acquiring new knowledge through finetuning is correlated with hallucinations w.r.t preexisting knowledge. Additionally, this work raises important questions for future exploration, regard- ing fine-tuning practices. We saw that Unknown examples are fitted slower than the Known ones, thus their negative effect manifests as a form of overfitting, which emphasizes the importance of using early-stopping instead of a fixed number of finetuning steps. However, early-stopping may be less effective when fine-tuning on numerous tasks with distinct optimal stopping points. An alternative solution can be to align the fine-tuning data with the model’s knowledge by filtering-out Unknown examples. We show initial evidence that this can reduce the risk of overfitting without compromising performance. A possible drawback of filtering is that Unknown fine-tuning examples can still be useful to teach LLMs to express uncertainty on Unknown test examples (Zhang et al., 2023).