What makes preference distributions unimodal versus genuinely disagreement-heavy?
This explores whether a 'unimodal' preference distribution is a real property of what people want, or an artifact of how preferences get measured and modeled — and what actually distinguishes a true consensus from genuine, structured disagreement.
This reads the question as asking what separates a preference distribution that genuinely clusters around one peak from one where the apparent consensus is really an artifact — disagreement that got flattened by the model rather than absent from the people. The corpus suggests the distinction is rarely about the users and almost always about the machinery you point at them.
The standard Bradley-Terry-Luce reward model *assumes* unimodality before it sees any data: it fits a single utility function, so maximum-likelihood fitting drags conflicting groups toward a centroid that optimizes nobody Do unimodal reward models actually serve all user preferences?. The sharpest way to see why this is a representational failure, not a quality problem, is the 51-49 case: a single aggregate model facing a near-even split must either leave 49% unhappy always or leave everyone unhappy half the time — there is no single peak that honors both Can aggregate reward models satisfy genuinely disagreeing users?. So a distribution can *look* unimodal simply because the model has no vocabulary for the second mode.
That means the real question is upstream, in the annotations. Genuine disagreement is hard to tell from noise because annotation responses aren't one signal — they decompose into genuine preferences, non-attitudes, and constructed-on-the-spot preferences, separable only by whether they hold up across measurement conditions Do all annotation responses measure the same underlying thing?. Disagreement that comes from stable, consistent genuine preferences is the multi-modal kind worth preserving; disagreement that evaporates when you re-ask is just non-attitude noise that *should* collapse toward one mode. Treating them the same contaminates the reward model and manufactures false unimodality.
There's a subtler trap: even consistent agreement can be spurious. Preference models cluster tightly around surface features — length, structure, jargon, sycophancy, vagueness — that humans actually reject, with sycophancy showing model preference at 75-85% versus human 50% Why do preference models favor surface features over substance?. That's a fake unimodal peak built from training artifacts, not shared taste. And whether a domain *should* converge is itself domain-dependent: code rewards convergence toward correct answers (legitimately unimodal), while creative writing rewards distinctiveness (legitimately multi-modal), and the same RLHF pressure narrows one while widening the other Does preference tuning always reduce diversity the same way?.
The twist the corpus leaves you with: the fix for false unimodality has its own failure mode. Recovering the real modes via user-conditional modeling Do unimodal reward models actually serve all user preferences? or personalizing per user removes the averaging that was quietly suppressing sycophancy and polarization — so you trade a centroid that pleases nobody for echo chambers that flatter everybody Does personalizing reward models amplify user echo chambers?. Genuine disagreement, honestly represented, isn't automatically safer than a false consensus; it's just a different problem.
Sources 6 notes
Standard BTL reward models assume a single utility function, but when preferences are genuinely multi-modal across user groups, maximum-likelihood fitting produces a centroid policy that optimizes nobody's utility. VPL recovers multi-modal distributions using latent user context, enabling user-conditional reward modeling.
Single reward models trained on aggregated preferences cannot represent disagreement. A 51-49 preference split forces a choice between leaving 49% unhappy always or leaving everyone unhappy half the time. This is a representational failure, not a quality problem.
Behavioral science reveals that annotations contain genuine preferences, non-attitudes, and constructed preferences—distinguishable by consistency across measurement conditions. Treating them uniformly contaminates reward model training and downstream alignment.
Preference models correlate positively with length, structure, jargon, sycophancy, and vagueness (r=+0.36) while humans correlate negatively (r=-0.12). Sycophancy shows the largest divergence at 75-85% model preference versus 50% human preference, driven by training data artifacts rather than semantic content.
RLHF reduces lexical-syntactic diversity in code generation but increases it in creative writing. The direction depends on what each domain incentivizes: code rewards convergence toward correct solutions, while creative writing rewards stylistic distinctiveness.
Specializing reward models per user removes the averaging effect of aggregate models, allowing systems to learn sycophancy and reinforce polarization at scale, mirroring recommender-system failures.