Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

Paper · arXiv 2403.04132 · Published March 7, 2024

Large Language Models (LLMs) have unlocked new capabilities and applications; however, evaluating the alignment with human preferences still poses significant challenges. To address this issue, we introduce Chatbot Arena, an open platform for evaluating LLMs based on human preferences. Our methodology employs a pairwise comparison approach and leverages input from a diverse user base through crowdsourcing. The platform has been operational for several months, amassing over 240K votes. This paper describes the platform, analyzes the data we have collected so far, and explains the tried-and-true statistical methods we are using for efficient and accurate evaluation and ranking of models. We confirm that the crowdsourced questions are sufficiently diverse and discriminating and that the crowdsourced human votes are in good agreement with those of expert raters. These analyses collectively establish a robust foundation for the credibility of Chatbot Arena. Because of its unique value and openness, Chatbot Arena has emerged as one of the most referenced LLM leaderboards, widely cited by leading LLM developers and companies. Our demo is publicly available at https://chat.lmsys.org.

Introduction. Recent advancements in large language models (LLMs) have significantly expanded their capabilities beyond traditional natural language processing boundaries, addressing a broad array of general tasks (OpenAI, 2023; Gemini et al., 2023; Touvron et al., 2023). These developments underscore the potential of LLMs but also have raised concerns with respect to performance evaluation. Current benchmarks often fail to capture the nuanced and diverse aspects of these models, particularly in assessing their alignment with human To assess the performance of LLMs, the research community has introduced a variety of benchmarks. These benchmarks can be categorized based on two factors: the source of questions (either static or live) and the evaluation metric (either ground truth or human preference). According to these factors, benchmarks can be classified into four categories, as shown in Figure 1.

Discussion / Conclusion. Limitations. Although our user base is extensive, we anticipate that it will primarily consist of LLM hobbyists and researchers who are eager to experiment with and evaluate the latest LLMs. This inclination may result in a biased distribution of users. Additionally, despite the wide array of topics encompassed by the prompts discussed in previous sections, the data predominantly comes from our online chat interface. This source might not accurately reflect the real-world usage of LLMs in production environments or specialized domains, potentially leading to a skewed prompt distribution. Moreover, our study concentrates on assessing the helpfulness of LLMs but overlooks their safety aspects. We recognize the possibility and necessity of a parallel In this paper, we present Chatbot Arena, an open platform for evaluating LLMs through crowdsourced, pairwise human preferences. We conduct an in-depth analysis of the crowdsourced user prompts and preference votes to validate the diversity and quality. We develop an efficient model sampling and ranking algorithm. Our dataset including 100K pairwise preference votes will be released for future research.

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

Synthesis notes that discuss concepts related to this paper