Do unimodal reward models actually serve all user preferences?
Standard RLHF assumes a single utility function across all users, but what happens when preferences genuinely conflict? Does averaging these opposing preferences into one model systematically fail certain groups?
The dominant RLHF formulation assumes all human preferences derive from a single utility function. Apply the Bradley-Terry-Luce model under this assumption, fit a reward to aggregate preference data, optimize the policy. The model class is unimodal.
This breaks when human preferences are genuinely multi-modal — when different groups of users derive opposing utilities from the same response attributes. The classic case: one group prefers detailed responses, another prefers concise ones. Maximum-likelihood estimation under unimodal BTL learns a reward function that averages these preferences. The resulting policy is optimized to a centroid that maximizes nobody's utility. Each subgroup is systematically failed.
VPL (2408.10075) treats this as a latent-variable problem. The user's preferences come from a latent context z (the user's hidden type). The reward function is conditioned on z. A variational encoder, given a few preference annotations from a user, infers a posterior over z. The reward model then makes user-specific predictions. Under variational inference, this is principled — an ELBO can be derived for latent-variable preference-based reward optimization.
Two technical considerations emerge. First, binary comparisons inherently lack information about reward scale — they constrain only the difference r_A − r_B. Different users may end up with vastly different reward magnitudes that destabilize multi-user RL. A simple pairwise classification scheme bounds and scales reward estimates within the latent variable framework. Second, the variational structure provides predictive uncertainty over the user's latent — enabling things like active learning and abstention.
The conceptual move matters beyond reward models. The unimodal assumption is doing more work than it appears. Across many RLHF deployments, "preference data" silently aggregates conflicting utilities, and the resulting policy is systematically miscalibrated for every subgroup — not just one. The averaging is not a smoothing operation that gracefully degrades; it is an actively-wrong specification that produces a policy nobody wants.
This connects directly to Can text summaries beat embeddings for personalized reward models?: PLUS replaces the embedding latent with a text summary, achieving the same conditioning effect with interpretability and portability. VPL is the variational baseline that PLUS improves upon by switching the latent representation from a vector to text.
Pluralistic alignment is not a refinement of RLHF — it is the correction of a categorically-mis-specified assumption.
Inquiring lines that use this note as a source 11
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How does RLHF reward structure incentivize agreement over accuracy?
- Can reward models be personalized if annotators lack stable preferences?
- What preference dimensions do base reward functions typically capture?
- What makes minority preferences disappear in aggregated single-distribution reward models?
- What makes preference distributions unimodal versus genuinely disagreement-heavy?
- How do aggregate reward models fail to capture minority user preferences?
- What unmeasured side channels emerge from RLHF preference optimization?
- Can user preferences be represented as linear reward combinations?
- Can reward models distinguish between personal preference and community consensus?
- Why does single-reward RLHF fail to represent diverse human preferences?
- How do aggregate reward models systematically exclude minority preferences?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can text summaries beat embeddings for personalized reward models?
When training reward models on diverse user preferences, does conditioning on learned text-based summaries of user preferences outperform embedding vectors? This matters because better representations could make personalization more interpretable and portable.
PLUS replaces VPL's vector latent with a text summary; same conditioning principle, more interpretable representation
-
Can reward models learn by comparing policies instead of judging them?
What if reward models worked as policy discriminators—measuring distance to a target rather than encoding absolute preferences? Could this eliminate the need for manual preference labels and scale across domains?
POLAR avoids the unimodal-preference problem by reframing reward as similarity-to-target rather than absolute-preference; orthogonal escape from the same trap
-
Do reward models actually consider what the prompt asks?
Exploring whether standard reward models evaluate responses based on prompt context or just response quality alone. This matters because if models ignore prompts, they'll fail to align with what users actually want.
both papers diagnose hidden mis-specifications in standard RM training; VPL on user dimension, the decomposition paper on prompt dimension
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Personalizing Reinforcement Learning from Human Feedback with Variational Preference Learning
- Learning Pluralistic User Preferences through Reinforcement Learning Fine-tuned Summaries
- Capturing Individual Human Preferences with Reward Features
- Measuring Human Preferences in RLHF is a Social Science Problem
- Self-Improving Model Steering
- Enhancing personalized multi-turn dialogue with curiosity reward
- Reward Reasoning Model
- Reward-Robust RLHF in LLMs
Original note title
unimodal BTL reward models average across multi-modal preferences and produce policies that fail every subgroup