Pre-Trained Policy Discriminators are General Reward Models
We offer a novel perspective on reward modeling by formulating it as a policy discriminator, which quantifies the difference between two policies to generate a reward signal, guiding the training policy towards a target policy with desired behaviors. Based on this conceptual insight, we propose a scalable pre-training method named POLicy DiscriminAtive LeaRning (POLAR), which trains a reward model (RM) to discern identical policies and discriminate different ones. Unlike traditional reward modeling methods relying on absolute preferences, PO- LAR captures the relative difference between one policy and an arbitrary target policy, which is a scalable, high-level optimization objective suitable for modeling generic ranking relationships. Leveraging the POLAR pre-training paradigm, we present a series of RMs with parameter scales from 1.8B to 7B. Empirical results show that POLAR substantially outperforms traditional non-pre-trained methods, significantly enhancing RM performance. For instance, POLAR-7B could improve preference accuracy from 54.8% to 81.0% on STEM tasks and from 57.9% to 85.5% on creative writing tasks compared to SOTA baselines.
Introduction. Reinforcement learning (RL) plays a crucial role in the post-training of large language models (LLMs) [126; 78; 4]. Its success hinges on the reward model’s (RM) ability to provide precise and stable feedback to the policy model [107; 28]. Although recent approaches successfully leverage labeled preference pairs to train RMs for alignment with human preferences, these methods often face challenges in terms of scalability and generalization [116; 64; 72; 49; 105]. The former is limited by the difficulty of acquiring large volumes of high-quality labeled pairs [23; 21], while the latter stems from the fact that this subjective approach to modeling human preferences makes RMs vulnerable to reward hacking [15; 12; 120]. On the other hand, several works, such as DeepSeek’s R1 [34], utilize Before delving into reward modeling, it is instructive to revisit the widespread success of LLMs.
Discussion / Conclusion. Pros and Cons of Reference Trajectories On the one hand, the use of reference trajectories significantly enhances the accuracy and reliability of reward models. Ideally, incorporating multiple Exploring Scaling Potential of POLAR Given the observed scaling law behavior, we anticipate that the current POLAR series has substantial room for further performance improvements. In future research, we plan to leverage greater computational resources to train larger-scale POLAR RMs. The data-preparing strategy employed by POLAR can be effectively scaled up to extensive pretraining datasets; however, this process inherently demands substantial policy sampling. Compared to traditional LLM data preparation, generating sufficient training data for POLAR is likely to incur considerably higher computational costs. By scaling up model size and computational resources, we aim to thoroughly investigate the limits of POLAR and release stronger, open-source models, thus facilitating continued advancements within the research community. We propose a novel perspective on RM by reformulating it as a policy discriminator and introduce a scalable approach named Policy Discriminative Learning (POLAR).