INQUIRING LINE

Why does preference measurement validity matter before any aggregation?

This explores a sequencing argument in preference-based AI training: averaging or pooling preference data can't rescue measurements that were flawed to begin with, so the question is what 'valid' even means before you start combining signals.


This reads the question as one about order of operations in alignment: before you aggregate preferences into a reward model, you have to know whether what you measured is real. The corpus's sharpest claim here is that annotation responses don't all measure the same thing — they decompose into genuine preferences, non-attitudes (answers given when the person has no real opinion), and constructed-on-the-spot preferences, distinguishable only by whether they stay consistent across measurement conditions Do all annotation responses measure the same underlying thing?. If you pool all three as if they were one signal, aggregation doesn't average out the noise; it launders it into the reward model and then into the model's behavior. Validity is upstream because aggregation is a mixing step, and mixing preserves bias it can't see.

The contamination isn't only at the individual level. Ratings people give are themselves shaped by the ratings they've already seen — online reviews decompose into baseline quality, a social-dynamics influence term, and error, with prior ratings measurably bending later ones and compounding over time Do online ratings actually reflect independent customer opinions?. So the thing you're aggregating may not be an independent read on quality at all; it can be a partial echo of earlier reads. Aggregate that, and you amplify a herding artifact while believing you've measured consensus.

Then there's a second, separate failure that makes the first one matter more: even perfectly valid measurements can't be aggregated cleanly when people genuinely disagree. A single reward model trained on pooled preferences provably cannot represent a divided population — a 51-49 split forces you to leave 49% unhappy always or everyone unhappy half the time Can aggregate reward models satisfy genuinely disagreeing users?, and MaxMin-RLHF formalizes this as an impossibility result where averaging silently erases minority viewpoints Can a single reward model represent diverse human preferences?. This is why measurement validity has to come first: if aggregation itself already destroys structure (disagreement), you cannot afford to also feed it signal you haven't validated. The two problems compound — invalid inputs into a lossy combiner.

There's a deeper statistical reason the order matters. Preference data isn't i.i.d. across raters; the learning bounds for a reward model depend on both how many examples each rater gave and how many distinct raters you have Does preference data need more raters than examples?. That means 'who' you measured is part of the measurement, not metadata you can flatten away during aggregation. And the escape hatch — personalizing per user instead of pooling — removes the averaging that was accidentally suppressing bad behavior, so without validity safeguards it amplifies sycophancy and echo chambers exactly the way recommender systems do Does personalizing reward models amplify user echo chambers?. Neither pooling nor personalizing fixes a bad measurement; they just distribute its consequences differently.

The thread worth pulling: there's a general pattern here where consistency gets mistaken for correctness. A zero-temperature LLM produces the same output every time, but that reproducibility is fixed randomness, not reliability — it's still one draw from a distribution Does setting temperature to zero actually make LLM outputs reliable?. Preference measurement has the same trap: a stable, agreeable-looking aggregate can be a confident average of non-attitudes. The reason validity comes before aggregation is that aggregation manufactures exactly the kind of surface stability that hides whether anything underneath was ever measured at all.


Sources 7 notes

Do all annotation responses measure the same underlying thing?

Behavioral science reveals that annotations contain genuine preferences, non-attitudes, and constructed preferences—distinguishable by consistency across measurement conditions. Treating them uniformly contaminates reward model training and downstream alignment.

Do online ratings actually reflect independent customer opinions?

Moe and Trusov decomposed ratings into baseline quality, social-dynamics influence, and error, finding that prior ratings meaningfully affect subsequent ones. These effects have both immediate sales impact and long-term compounding effects through future ratings, though high opinion variance can eventually dampen the distortion.

Can aggregate reward models satisfy genuinely disagreeing users?

Single reward models trained on aggregated preferences cannot represent disagreement. A 51-49 preference split forces a choice between leaving 49% unhappy always or leaving everyone unhappy half the time. This is a representational failure, not a quality problem.

Can a single reward model represent diverse human preferences?

MaxMin-RLHF proves an impossibility result: fitting one reward model to aggregated preferences silently erases minority viewpoints. The solution is learning a mixture of preference distributions and optimizing a MaxMin objective from social choice theory to protect the worst-off groups.

Does preference data need more raters than examples?

Preference data is not i.i.d. across raters with different preferences. PAC bounds for personalized reward models decompose into terms depending on both examples per rater and number of raters, showing rater diversity matters as much as data volume.

Does personalizing reward models amplify user echo chambers?

Specializing reward models per user removes the averaging effect of aggregate models, allowing systems to learn sycophancy and reinforce polarization at scale, mirroring recommender-system failures.

Does setting temperature to zero actually make LLM outputs reliable?

Fixed seeds and zero temperature replicate the same output repeatedly, but that output remains one draw from the model's probability distribution. McDonald's omega testing across 100 repetitions reveals that consistency does not equal reliability.

Next inquiring lines