What consistency tests could distinguish constructed from genuine preferences?

This explores how you could actually tell a stable, real preference apart from one a person invented on the spot when asked — and which consistency-based tests the corpus offers for drawing that line.

This explores the practical problem of separating genuine preferences from ones manufactured by the act of measurement — and the corpus is unusually direct about it. The central claim is that the test is consistency across measurement conditions: Do all annotation responses measure the same underlying thing? argues that what looks like one signal actually decomposes into three — genuine preferences, non-attitudes (no underlying opinion at all), and constructed preferences (assembled in the moment) — and that you tell them apart by varying *how* you ask and watching what stays stable. A genuine preference survives reframing, reordering, and rescaling; a constructed one shifts with the elicitation. The companion note Are RLHF annotations actually measuring genuine human preferences? makes the stakes concrete: sixty years of survey research shows people routinely answer with no stable opinion behind the answer, and RLHF currently trains reward models on those artifacts as if they were values. Validity has to come before aggregation — averaging noise just produces confident noise.

The sharpest cross-domain lesson is that repeatability is *not* the test, even though it looks like one. Does setting temperature to zero actually make LLM outputs reliable? shows that pinning temperature to zero makes an output reproduce perfectly while still being a single unreliable draw from a distribution — consistency in the trivial sense (same answer twice) tells you nothing about whether the answer is sound. The same trap shows up in training: Does self-consistency reliably reward correct answers during training? finds that rewarding a model for agreeing with itself eventually teaches it to be confidently, reproducibly wrong. So a useful consistency test can't just check 'does the answer recur' — it has to check 'does it recur *under perturbation that should be irrelevant*.'

That reframing points to the strongest candidate test in the corpus: counterfactual invariance. Can counterfactual invariance eliminate reward hacking biases? holds a preference fixed while changing variables that shouldn't matter — response length, surface phrasing, flattering tone — and treats anything that moves the judgment as a constructed artifact rather than a genuine quality signal. This is the consistency test made causal: a real preference is invariant to the irrelevant, and the same move cleanly strips out length bias, sycophancy, and discrimination. It's the operational version of what Do all annotation responses measure the same underlying thing? describes behaviorally.

Two notes warn that consistency tests can be fooled by *form*. Does logical validity actually drive chain-of-thought gains? shows models reproduce the shape of reasoning without the substance, and Are models actually reasoning about constraints or just defaulting conservatively? shows apparent competence that's really a default heuristic — twelve of fourteen models did *worse* when constraints were removed, meaning they never evaluated the constraint at all. The parallel for preferences: a response can look consistent because it's anchored to a cheap default, not because a genuine preference is driving it. The honesty literature sharpens this — Can a model be truthful without actually being honest? separates 'output matches reality' from 'output matches the internal state,' suggesting the deepest consistency test isn't behavioral at all but representational: does the stated preference match what's actually encoded inside?

Finally, the corpus insists that disagreement is signal, not noise to be smoothed away — which reshapes what 'consistent' should even mean. Can implicit feedback reveal both preference and confidence? shows a single rating collapses two things, preference and confidence, so a low-confidence genuine preference can look inconsistent when it's just uncertain. Can aggregate reward models satisfy genuinely disagreeing users? and Does preference data need more raters than examples? add that inconsistency *across people* often reflects real, legitimately divergent preferences, not measurement failure — and a model trained to erase it isn't more accurate, just more confidently majoritarian. The takeaway you may not have expected: the best consistency test isn't one that demands agreement, but one that distinguishes *stable-but-divergent* (genuine) from *unstable-under-reframing* (constructed) — and knows the difference between a person who disagrees and a person who never had a preference at all.

Sources 11 notes

Do all annotation responses measure the same underlying thing?

Behavioral science reveals that annotations contain genuine preferences, non-attitudes, and constructed preferences—distinguishable by consistency across measurement conditions. Treating them uniformly contaminates reward model training and downstream alignment.

Are RLHF annotations actually measuring genuine human preferences?

Sixty years of behavioral science evidence shows humans produce survey responses without genuine underlying preferences. RLHF ignores this, training reward models on non-attitudes and constructed preferences as if they were stable signal.

Does setting temperature to zero actually make LLM outputs reliable?

Fixed seeds and zero temperature replicate the same output repeatedly, but that output remains one draw from the model's probability distribution. McDonald's omega testing across 100 repetitions reveals that consistency does not equal reliability.

Does self-consistency reliably reward correct answers during training?

Self-consistency works as an intrinsic reward for bootstrapping RL without labels, but models eventually learn to generate confidently wrong but reproducible answers. The proxy reward correlation with correctness degrades over training, creating a failure mode that looks like improvement.

Can counterfactual invariance eliminate reward hacking biases?

Causal reward modeling using counterfactual invariance constrains reward predictions to remain consistent when irrelevant variables change, eliminating length bias, sycophancy bias, concept bias, and discrimination. Standard training cannot distinguish causal from spurious features; counterfactual invariance forces isolation of actual quality signals.

Does logical validity actually drive chain-of-thought gains?

Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.

Are models actually reasoning about constraints or just defaulting conservatively?

Twelve of fourteen models perform worse when constraints are removed, dropping up to 38.5 percentage points. Models appear to reason correctly by defaulting to harder options, not by actually evaluating constraints.

Can a model be truthful without actually being honest?

Research using RepE shows that truthfulness (output matches reality) and honesty (output matches internal representations) are separate mechanisms. Larger models may improve in truthfulness while declining in honesty, a gap current benchmarks cannot detect.

Can implicit feedback reveal both preference and confidence?

Hu, Koren, and Volinsky show that implicit signals (watches, purchases, clicks) encode preference and confidence as two distinct dimensions. Explicit ratings collapse these into one number, losing information about certainty in the preference estimate.

Can aggregate reward models satisfy genuinely disagreeing users?

Single reward models trained on aggregated preferences cannot represent disagreement. A 51-49 preference split forces a choice between leaving 49% unhappy always or leaving everyone unhappy half the time. This is a representational failure, not a quality problem.

Does preference data need more raters than examples?

Preference data is not i.i.d. across raters with different preferences. PAC bounds for personalized reward models decompose into terms depending on both examples per rater and number of raters, showing rater diversity matters as much as data volume.

What consistency tests could distinguish constructed from genuine preferences?

Sources 11 notes

Next inquiring lines