RewardBench: Evaluating Reward Models for Language Modeling
Reward models (RMs) are at the crux of successfully using RLHF to align pretrained models to human preferences, yet there has been relatively little study that focuses on evaluation of those models. Evaluating reward models presents an opportunity to understand the opaque technologies used for alignment of language models and which values are embedded in them. Resources for reward model training and understanding are sparse in the nascent open-source community around them. To enhance scientific understanding of reward models, we present REWARDBENCH, a benchmark dataset and code-base for evaluation. The REWARDBENCH dataset is a collection of prompt-chosen-rejected trios spanning chat, reasoning, and safety, to benchmark how reward models perform on challenging, structured and out-of-distribution queries. We create specific comparison datasets for RMs that have subtle, but verifiable reasons (e.g. bugs, incorrect facts) why one answer should be preferred to another. On the REWARDBENCH leaderboard, we evaluate reward models trained with a variety of methods, such as the direct MLE training of classifiers and the implicit reward modeling of Direct Preference Optimization (DPO).
Introduction. Reinforcement learning from human feedback (RLHF) is a necessary but opaque tool underlying the success of popular language models (LMs) such as OpenAI’s ChatGPT (Schulman et al., 2022) and Anthropic’s Claude (Bai et al., 2022a). The prevalence of RLHF stems from its efficacy at circumventing one of the greatest difficulties in integrating human preferences into language models: specifying an explicit reward (Christiano et al., 2017). Reward models (RMs) are central to this process. They are created by copying the original language model and training it on labeled preference data, producing a model that can predict whether one piece of text is likely to be preferred over another. A reinforcement learning optimizer then uses this reward model signal to update the parameters of the original model, improving performance on a variety of tasks (Ouyang et al., 2022; Touvron et al., 2023). While the post-RLHF model (known as the policy) and even the pretrained model are extensively documented and evaluated, the basic properties of the RLHF process like the RMs receive far less attention.
Discussion / Conclusion. We present REWARDBENCH, and show the variety of performance characteristics of current reward models in order to improve understanding of RLHF. While we covered a variety of topics important to alignment of LMs, a crucial next step is needed to correlate performance in REWARDBENCH to RLHF usefulness. Initial experiments with ranking RMs with best-of-N sampling and downstream training with PPO are underway. We have taken a first step to understanding which values are embedded in the RLHF training across many base models and preference datasets. The toolkit we have released can easily be expanded include custom data to specifically audit a certain property of the RLHF process. Scores of RMs from private LM providers are on the public leaderboard, but are not in the paper because they are not reproducible. REWARDBENCH is one of many tools which will help us understand the science of whose and what values are embedded in our language models.