A Survey of Reinforcement Learning from Human Feedback

Paper · arXiv 2312.14925 · Published December 22, 2023

Reinforcement learning from human feedback (RLHF) is a variant of reinforcement learning (RL) that learns from human feedback instead of relying on an engineered reward function. Building on prior work on the related setting of preference-based reinforcement learning (PbRL), it stands at the intersection of artificial intelligence and human-computer interaction. This positioning offers a promising avenue to enhance the performance and adaptability of intelligent systems while also improving the alignment of their objectives with human values. The training of large language models (LLMs) has impressively demonstrated this potential in recent years, where RLHF played a decisive role in directing the model’s capabilities toward human objectives. This article provides a comprehensive overview of the fundamentals of RLHF, exploring the intricate dynamics between RL agents and human input. While recent focus has been on RLHF for LLMs, our survey adopts a broader perspective, examining the diverse applications and wide-ranging impact of the technique. We delve into the core principles that underpin RLHF, shedding light on the symbiotic relationship between algorithms and human feedback, and discuss the main research trends in the field.

Introduction. In reinforcement learning (RL), an agent traditionally navigates through an environment and attempts to make optimal decisions (i.e., action choices) through a process of trial and error. Whether a decision is optimal or not is determined solely by reward signals. These signals have to be defined by a system designer based on measurements of the agent’s performance, ensuring that the learning agent receives the necessary feedback to learn the correct behavior. Designing a reward function, however, is challenging. Indeed, success is hard to formally define and measure in many applications. Beyond that, a sparse signal of success may not be well suited for agent learning – resulting in the need for reward shaping (Ng et al., 1999), where the reward signal is transformed into one that is more suitable for learning. This often makes the reward signal more susceptible to spurious correlations, however – behaviors that are rewarded because they are usually correlated with the true objective but are not valuable in themselves.

A Survey of Reinforcement Learning from Human Feedback

Synthesis notes that discuss concepts related to this paper