How do aggregate reward models fail to capture minority user preferences?
This explores why a single reward model trained on everyone's combined preferences ends up serving no one well when users genuinely disagree — and what that failure looks like mathematically and in practice.
This explores why a single reward model trained on everyone's combined preferences ends up serving no one well when users genuinely disagree. The core issue is representational, not a matter of better data or more training. When you average preferences across a population, a 51-49 split forces an impossible choice: always satisfy the majority and leave 49% unhappy, or split the difference and leave everyone unhappy half the time Can aggregate reward models satisfy genuinely disagreeing users?. The minority view isn't poorly modeled — it's structurally unrepresentable in a model that can only output one ranking.
The mathematics makes this sharper. Standard reward models assume a single underlying utility function (the Bradley-Terry-Luce setup). But when preferences are genuinely multi-modal — different groups wanting genuinely different things — fitting one function by maximum likelihood produces a centroid: a policy that lands in the middle and optimizes nobody's actual utility Do unimodal reward models actually serve all user preferences?. The averaging that makes aggregate models seem 'fair' is exactly what erases the subgroups. This same dynamic shows up in recommender systems, where accuracy-optimized models over-weight a user's dominant interests and crowd out their minority tastes — the fix there is post-hoc reranking that enforces proportional representation without retraining Why do accuracy-optimized recommenders crowd out minority interests?.
Part of the problem is upstream, in the annotations themselves. Preference labels aren't one clean signal — they decompose into genuine preferences, non-attitudes, and constructed-on-the-spot preferences, and treating them uniformly contaminates the reward model Do all annotation responses measure the same underlying thing?. Ranking systems compound this by baking in selection bias, converging on degenerate equilibria that amplify their own past decisions and the majority behavior that fed them Why do ranking systems need to model selection bias explicitly?.
Here's the twist worth knowing: the obvious fix — give each user their own personalized reward model — has its own failure mode. Removing the averaging effect lets the system learn pure sycophancy and reinforce echo chambers at scale, mirroring how recommender systems polarize Does personalizing reward models amplify user echo chambers?. So minority preferences sit on a knife's edge: aggregate models erase them, fully personalized models can trap users in them.
The corpus points toward a middle path. Rather than one model or one-per-user, you can condition a reward model on latent user context to recover the full multi-modal distribution Do unimodal reward models actually serve all user preferences?, or represent each user as a linear combination of shared base reward functions inferred from as few as ten adaptive questions Can user preferences be learned from just ten questions?. Interestingly, learned text summaries of a user's preferences condition reward models more effectively than embedding vectors — and stay interpretable to the user Can text summaries beat embeddings for personalized reward models?. The thread connecting all these: minority preferences fail not because they're hard to learn, but because the standard architecture is built to collapse disagreement into a single number.
Sources 8 notes
Single reward models trained on aggregated preferences cannot represent disagreement. A 51-49 preference split forces a choice between leaving 49% unhappy always or leaving everyone unhappy half the time. This is a representational failure, not a quality problem.
Standard BTL reward models assume a single utility function, but when preferences are genuinely multi-modal across user groups, maximum-likelihood fitting produces a centroid policy that optimizes nobody's utility. VPL recovers multi-modal distributions using latent user context, enabling user-conditional reward modeling.
Accuracy-optimized models systematically miscalibrate by over-weighting dominant user interests. A post-processing reranking algorithm that enforces calibration constraints can restore proportional representation without retraining the underlying model.
Behavioral science reveals that annotations contain genuine preferences, non-attitudes, and constructed preferences—distinguishable by consistency across measurement conditions. Treating them uniformly contaminates reward model training and downstream alignment.
YouTube's multi-objective ranker uses MMoE for conflicting objectives and a shallow position tower to remove selection bias from training data. Without both mechanisms, models converge on degenerate equilibria that amplify their own past decisions.
Specializing reward models per user removes the averaging effect of aggregate models, allowing systems to learn sycophancy and reinforce polarization at scale, mirroring recommender-system failures.
PReF learns base reward functions from preference data, then uses active learning to select maximally informative questions that reduce coefficient uncertainty. Users can be personalized via inference-time reward alignment without weight modification.
PLUS trains summarizers and reward models jointly, learning that text-based preference summaries capture dimensions zero-shot summaries miss. These summaries transfer to GPT-4 for zero-shot personalization and remain interpretable to users.