Should We Fine-Tune or RAG? Evaluating Different Techniques to Adapt LLMs for Dialogue

Paper · arXiv 2406.06399 · Published June 10, 2024
Training and Fine-TuningRetrieval-Augmented Generation (RAG)

We study the limitations of Large Language Models (LLMs) for the task of response generation in human-machine dialogue. Several techniques have been proposed in the literature for different dialogue types (e.g., Open- Domain). However, the evaluations of these techniques have been limited in terms of base LLMs, dialogue types and evaluation metrics. In this work, we extensively analyze different LLM adaptation techniques when applied to different dialogue types. We have selected two base LLMs, Llama2C and MistralI, and four dialogue types Open-Domain, Knowledge- Grounded, Task-Oriented, and Question Answering. We evaluate the performance of incontext learning and fine-tuning techniques across datasets selected for each dialogue type. We assess the impact of incorporating external knowledge to ground the generation in both scenarios of Retrieval-Augmented Generation (RAG) and gold knowledge. We adopt consistent evaluation and explainability criteria for automatic metrics and human evaluation protocols. Our analysis shows that there is no universal best-technique for adapting large language models as the efficacy of each technique depends on both the base LLM and the specific type of dialogue.

Introduction. In recent years, Large Language Models (LLMs) have been employed for the task of response generation in human-machine dialogues (Hosseini-Asl et al., 2020a; Izacard and Grave, 2021; Komeili et al., 2022). Such models have been applied to several dialogue types, including Open-Domain Dialogues (i.e. informal conversations about trivial matters), Knowledge-Grounded Dialogues (i.e. conversations with a system that provides factual responses), Task-Oriented Dialogues (i.e. conversations where the system helps a user to achieve a specific goal), and Question Answering (i.e. questionanswer exchanges given context). However, recent studies have shown the shortcomings of LLMs as dialogue model surrogates as they are prone to generate toxic, biased, and irrelevant responses (Zhang et al., 2020; Mousavi et al., 2022, 2023; Lin and Chen, 2023). To adapt LLMs to dialogue types, different techniques have been employed such as in-context learning (Brown et al., 2020; Chen et al., 2023; Meade et al., 2023) and fine-tuning (Wang et al., 2022; Komeili et al., 2022; Huang et al., 2023).

Discussion / Conclusion. We have conducted an extensive analysis on the efficacy of fine-tuning and in-context learning to adapt LLMs for different dialogue types. We have experimented with Retrieval-Augmented Generation (RAG) and gold knowledge to assess the impact of grounding the response generation on external knowledge. We have studied the models’ performance using consistent criteria in both automatic (perplexity, explainability studies) and human evaluations. Our study highlights the limitation of currently available automatic metrics and the necessity of conducting human evaluations to advance humanmachine dialogue research, as the evaluations by human judges correlate poorly with automatic metrics. Furthermore, conducted human evaluations indicate that there is no universal best-technique for adapting LLMs to a dialogue type and the performance of each technique depends on the base LLM as well as the dialogue type.