What preference dimensions do base reward functions typically capture?
This explores what a 'base reward function' actually encodes — the underlying dimensions of preference that reward models learn to score, and how reductive that single signal turns out to be.
This explores what a 'base reward function' actually encodes — the dimensions of preference a reward model scores against. The cleanest answer in the corpus comes from work that treats preference as factorizable: rather than one monolithic score, you learn a small set of *base* reward functions, each capturing a distinct dimension of what people value, and then represent any individual user as a linear combination of those bases Can user preferences be learned from just ten questions?. The striking finding there is how few dimensions you need — roughly ten well-chosen questions can pin down a user's coefficients — which implies the underlying preference space is low-dimensional and shared, even if each person sits at a different point in it.
But the more interesting story is what standard reward functions *fail* to capture. The conventional Bradley-Terry-Luce reward model assumes a single utility function for everyone, so when real preferences are genuinely multi-modal across groups, maximum-likelihood fitting collapses them into a centroid that optimizes nobody Do unimodal reward models actually serve all user preferences?. The same structural blind spot shows up as a representation problem: a 51-49 split among disagreeing users can't be expressed by one scalar at all, forcing the model to either disappoint the minority always or everyone half the time Can aggregate reward models satisfy genuinely disagreeing users?. So the dimension a base reward function 'typically' captures is, by construction, the *average* — and averaging is exactly where minority and conflicting preferences disappear.
There's a second axis the scalar misses entirely. Human feedback carries two orthogonal kinds of information — evaluative ('how good was this') and directive ('how should it change') — and a reward number captures only the first, discarding the directional content Can scalar rewards capture all the information in agent feedback?. Compounding this, the annotations the reward function is fit to aren't all measuring the same thing: behavioral-science analysis finds genuine preferences, non-attitudes, and on-the-spot constructed preferences mixed together, distinguishable only by how stable they are across conditions Do all annotation responses measure the same underlying thing?. A base reward function trained as if all three were the same signal is learning a blurred composite, not a clean preference dimension.
The responses to this push in two directions worth knowing about. One is to make the function conditional rather than universal — learned *text* summaries of a user's preferences turn out to condition reward models better than embedding vectors, and they capture dimensions zero-shot summaries miss while staying human-readable Can text summaries beat embeddings for personalized reward models?. The other is to abandon absolute preference scoring altogether: reframe the reward model as a *policy discriminator* that scores how close a behavior is to a target policy, which sidesteps the question of fixed preference labels entirely Can reward models learn by comparing policies instead of judging them?.
The quiet warning underneath all of this: the moment you stop averaging and let reward functions capture each user's true dimensions, you also let them learn sycophancy and reinforce echo chambers — the same failure mode recommender systems fell into Does personalizing reward models amplify user echo chambers?. So 'what dimensions a base reward function captures' isn't just a technical question; the averaging that makes it lossy is also, partly, what keeps it from amplifying our worst preferences back at us.
Sources 8 notes
PReF learns base reward functions from preference data, then uses active learning to select maximally informative questions that reduce coefficient uncertainty. Users can be personalized via inference-time reward alignment without weight modification.
Standard BTL reward models assume a single utility function, but when preferences are genuinely multi-modal across user groups, maximum-likelihood fitting produces a centroid policy that optimizes nobody's utility. VPL recovers multi-modal distributions using latent user context, enabling user-conditional reward modeling.
Single reward models trained on aggregated preferences cannot represent disagreement. A 51-49 preference split forces a choice between leaving 49% unhappy always or leaving everyone unhappy half the time. This is a representational failure, not a quality problem.
Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.
Behavioral science reveals that annotations contain genuine preferences, non-attitudes, and constructed preferences—distinguishable by consistency across measurement conditions. Treating them uniformly contaminates reward model training and downstream alignment.
PLUS trains summarizers and reward models jointly, learning that text-based preference summaries capture dimensions zero-shot summaries miss. These summaries transfer to GPT-4 for zero-shot personalization and remain interpretable to users.
POLAR reframes reward modeling as policy discrimination: RMs assign higher scores to policies similar to a chosen target, eliminating absolute preference labels. Pre-trained 1.8B-7B parameter POLAR RMs substantially outperform non-pre-trained methods and transfer across task formulations.
Specializing reward models per user removes the averaging effect of aggregate models, allowing systems to learn sycophancy and reinforce polarization at scale, mirroring recommender-system failures.