MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries
Retrieval-augmented generation (RAG) augments large language models (LLM) by retrieving relevant knowledge, showing promising potential in mitigating LLM hallucinations and enhancing response quality, thereby facilitating the great adoption of LLMs in practice. However, we find that existing RAG systems are inadequate in answering multi-hop queries, which require retrieving and reasoning over multiple pieces of supporting evidence. Furthermore, to our knowledge, no existing RAG benchmarking dataset focuses on multihop queries. In this paper, we develop a novel dataset, MultiHop-RAG, which consists of a knowledge base, a large collection of multihop queries, their ground-truth answers, and the associated supporting evidence. We detail the procedure of building the dataset, utilizing an English news article dataset as the underlying RAG knowledge base. We demonstrate the benchmarking utility of MultiHop- RAG in two experiments. The first experiment compares different embedding models for retrieving evidence for multi-hop queries. In the second experiment, we examine the capabilities of various state-of-the-art LLMs, including GPT-4, PaLM, and Llama2-70B, in reasoning and answering multi-hop queries given the evidence.
Introduction. The emergence of large language models (LLMs), such as ChatGPT, has fostered a wide range of innovations, powering intelligent chatbots and other natural language processing (NLP) applications (Ope- nAI, 2023). One promising use case is Retrieval- Augmented Generation (RAG) (Asai et al., 2023), which optimizes the output of a large language model by referencing an external knowledge base outside of the LLM training data sources before generating a response. RAG improves LLM’s response (Borgeaud et al., 2022) and also mitigates the occurrence of hallucinations, thereby enhancing the models’ credibility (Gao et al., 2023). LLMbased frameworks, such as LlamaIndex (Liu, 2022) and LangChain (Chase, 2022), specialize in supporting RAG pipelines. In real-world Retrieval-Augmented Generation (RAG) applications, a user’s query often necessitates retrieving and reasoning over evidence from multiple documents, a process known as multi-hop query. For instance, consider financial analysis using a database of financial reports.
Discussion / Conclusion. In this work, we introduce MultiHop-RAG, a novel and unique dataset designed for queries that require retrieval and reasoning from multiple pieces of supporting evidence. These types of multi-hop queries represent user queries commonly encountered in real-world scenarios. MultiHop-RAG consists of a knowledge base, a large collection of multi-hop queries, their ground-truth answers, and the associated supporting evidence. This paper details the creation process of MultiHop-RAG, employing a hybrid approach that integrates human effort with GPT-4. Additionally, we explore two use cases of MultiHop-RAG in the benchmarking of RAG systems, thereby highlighting the potential applications of this dataset. By publicly releasing MultiHop-RAG, we aim to provide a valuable resource to the community, contributing to the advancement and benchmarking of RAG systems. This work has several limitations that can be improved in future research.