Are RLHF annotations actually measuring genuine human preferences?
RLHF trains on annotation responses as stable preferences, but behavioral science shows humans often construct answers without holding real opinions. Does this measurement gap undermine the entire approach?
The RLHF research program has invested enormous effort in the final links of its chain: better reward modeling architectures, better preference aggregation rules, better fine-tuning algorithms. A logically prior question has received less systematic attention: do the annotation responses being modeled reflect genuine preferences at all? This paper argues — drawing on sixty years of behavioral science literature that the ML community has largely ignored — that they often may not, and that this measurement validity question must be answered before any aggregation or fine-tuning decision makes sense.
The behavioral science findings are well-established. Humans routinely produce answers to survey questions without holding genuine opinions, a phenomenon called non-attitudes (Converse 1964; Krosnick 1991). Preferences are often constructed on the spot, influenced by framing and context rather than retrieved from stable mental representations (Slovic 1995; Payne et al. 1993). The same question can measure different constructs for different people (Vandenberg & Lance 2000). These are not marginal effects. They are pervasive for precisely the value-laden judgments that matter most for alignment: "should the AI refuse this request," "which response is more helpful," "is this harmful." Current RLHF practice trains reward models to predict the majority label, filters or downweights high-disagreement items, and produces a scalar reward that discards information about whether judgments were contested. The result: RLHF may be "systematically modeling noise as signal and elicitation artifacts as human values."
The logical ordering matters. Before asking how to aggregate diverse preferences, the field must ask whether the responses being aggregated are preferences at all. Before personalizing reward models to individual annotators, the field must ask whether those annotators have stable preferences to personalize. Before filtering high-disagreement items as noise, the field must ask whether disagreement signals contested values, absent values, or constructed preferences that would give different answers to the same question twenty minutes later. Each of these downstream questions presumes a solved version of the measurement validity question — and that presumption is not warranted by current practice.
This provides a second-line defense against preferentism that reaches even readers who accept preferentism in principle. Should AI alignment target preferences or social role norms? argues preferences are the wrong target on normative grounds. Measuring Human Preferences argues that even within the preferentist framework, the measurement inputs are invalid — so aggregation cannot save the approach. Together they form a pincer: preferences are both wrong-in-kind and wrong-in-measurement.
The paper's constructive contribution is a research agenda: treat measurement validity as logically prior to aggregation. Diagnose non-attitudes, constructed preferences, and measurement artifacts using the consistency criterion (do responses stabilize across equivalent conditions?). Route each type to appropriate treatment rather than collapsing them into a single signal. The alternative is an RLHF pipeline that fights downstream artifacts it inherits from upstream measurement failures — which is where the field finds itself when Why do preference models favor surface features over substance? and Why do reasoning models fail at predicting disagreement? document 40%+ divergences and systematic disagreement-suppression without being able to point to the upstream cause.
The practical implication is uncomfortable. If measurement validity is suspect, then a significant portion of the alignment investment of the last several years has been optimizing the wrong objective — not because preferences are the wrong target (the Beyond Preferences critique) but because the "preferences" in the training data are not preferences.
Inquiring lines that use this note as a source 17
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Does RLHF training create models that sound convincing without being more accurate?
- Why does RLHF degrade honesty while improving surface-level helpfulness?
- How does evaluator time pressure shape what behaviors RLHF rewards?
- How does Peircean Secondness differ from what RLHF actually provides?
- Does RLHF training specifically teach models to prioritize user agreement over accuracy?
- Why does RLHF degrade model calibration despite improving preference alignment?
- What consistency tests could distinguish constructed from genuine preferences?
- How much do training methods like RLHF directly cause sycophantic model behavior?
- Why does RLHF training optimize for perceived quality over practical accuracy?
- Why does RLHF alone fail to fully prevent opinion copying?
- What makes emotion scores more stable than human preference labels?
- Does RLHF training create realized quasi-psychologies or just stickier pretense?
- How do annotation artifacts get mistaken for genuine human values?
- What unmeasured side channels emerge from RLHF preference optimization?
- How does constitutional alignment compare to RLHF in removing human annotation costs?
- Why does single-reward RLHF fail to represent diverse human preferences?
- What validity threats exist in crowdsourced preference signals?
Related concepts in this collection 10
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Do all annotation responses measure the same underlying thing?
Explores whether RLHF's treatment of all annotations as equivalent signals overlooks fundamental differences in what those responses actually represent—stable preferences versus non-attitudes versus context-dependent constructions.
the companion insight detailing the diagnostic taxonomy
-
Should AI alignment target preferences or social role norms?
Current AI alignment approaches optimize for individual or aggregate human preferences. But do preferences actually capture what matters morally, or should alignment instead target the normative standards appropriate to an AI system's specific social role?
the normative pincer: preferences are wrong-in-kind; this note is the measurement pincer: preferences are wrong-in-measurement
-
Why do preference models favor surface features over substance?
Preference models show systematic bias toward length, structure, jargon, sycophancy, and vagueness—features humans actively dislike. Understanding this 40% divergence reveals whether it stems from training data artifacts or architectural constraints.
the 40% divergence is a downstream symptom of measurement validity failure
-
Why do reasoning models fail at predicting disagreement?
RLVR models optimize for single correct answers, but many real tasks involve legitimate disagreement among annotators. Does this optimization fundamentally suppress the model's ability to capture when humans reasonably disagree?
suppression of legitimate disagreement variance is the measurement failure in action
-
Can text summaries beat embeddings for personalized reward models?
When training reward models on diverse user preferences, does conditioning on learned text-based summaries of user preferences outperform embedding vectors? This matters because better representations could make personalization more interpretable and portable.
text-based summaries may recover context lost when scalar rewards discard disagreement signals
-
Why do LLM persona prompts produce inconsistent outputs across runs?
Can language models reliably simulate different social perspectives through persona prompting, or does their run-to-run variance indicate they lack stable group-specific knowledge? This matters for whether LLMs can approximate human disagreement in annotation tasks.
unstable across re-asks is exactly the constructed-preference signature
-
Can models learn to ignore irrelevant prompt changes?
Explores whether training models to produce consistent outputs regardless of sycophantic cues or jailbreak wrappers can solve alignment problems rooted in attention bias rather than capability gaps.
consistency as a diagnostic for validity maps directly to consistency as a training objective
-
Can models learn to abstain when uncertain about predictions?
Explores whether language models can be trained to recognize when they lack sufficient information to forecast conversation outcomes, rather than forcing uncertain predictions into confident-sounding responses.
abstention on uncertain outputs is the modeling-side analog of filtering non-attitudes at the input side
-
Why do LLM judges fail at predicting sparse user preferences?
When LLMs judge user preferences based on limited persona information, what causes their predictions to become unreliable? Understanding persona sparsity's role in judgment failure could improve personalization systems.
persona sparsity and measurement validity are adjacent: sparse sampling produces artifacts
-
Does transformer attention architecture inherently favor repeated content?
Explores whether soft attention's tendency to over-weight repeated and prominent tokens explains sycophancy independent of training. Questions whether architectural bias precedes and enables RLHF effects.
complementary failure at the architectural layer: upstream measurement produces artifacts as data, architectural attention amplifies whatever is in context regardless; together they guarantee sycophancy survives any pipeline cleanup
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Measuring Human Preferences in RLHF is a Social Science Problem
- Learning Pluralistic User Preferences through Reinforcement Learning Fine-tuned Summaries
- RewardBench: Evaluating Reward Models for Language Modeling
- Self-Improving Model Steering
- Rewards-in-Context: Multi-objective Alignment of Foundation Models with Dynamic Preference Adjustment
- Tulu 3: Pushing Frontiers in Open Language Model Post-Training
- Can Large Language Models Capture Human Annotator Disagreements?
- Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Original note title
preference measurement validity is logically prior to preference aggregation — RLHF may be systematically modeling elicitation artifacts as human values