INQUIRING LINE

Why do non-attitudes cluster around value-laden questions most relevant to alignment?

This explores why the fuzziest, least-stable annotator responses — 'non-attitudes,' opinions people don't actually hold but produce on demand — tend to concentrate exactly on the morally loaded questions alignment depends on most.


This explores why the least-stable annotator responses — 'non-attitudes,' answers people give without actually holding the underlying opinion — tend to pile up precisely on the value-laden questions that alignment work cares about most. The corpus suggests this isn't a labeling defect to be cleaned out; it's a signature of what happens when you ask a single forced-choice question to stand in for a contested moral judgment.

Start with the anatomy of an annotation. One line of work argues that annotator responses aren't one thing — they decompose into genuine preferences, non-attitudes, and constructed-on-the-spot preferences, distinguishable only by whether they hold steady across different ways of asking Do all annotation responses measure the same underlying thing?. Non-attitudes are the ones that wobble. Now notice where the wobble should be worst: on questions about care, fairness, authority, harm — the thick moral terrain. A complementary finding shows that interpretations of socially embedded sentences are irreducibly multiple, varying with a reader's social position, and that this disagreement is real signal rather than annotator error Why do readers interpret the same sentence so differently?. So the value-laden questions are exactly the ones where there's no single stable 'true' answer to recover — which is the structural condition under which a forced annotation manufactures a non-attitude rather than measuring one.

The deeper claim is that the clustering is downstream of a category error in what alignment treats as its target. One argument holds that preferences simply don't capture thick moral values, and that aggregating them uniformly produces epistemic injustice and systematic misalignment — the fix being norms negotiated by stakeholders, not preferences averaged across a crowd Should AI alignment target preferences or social role norms?. Read alongside the decomposition finding, this is illuminating: non-attitudes cluster on value questions *because* those questions were never well-posed as preference elicitations in the first place. You're asking people to emit a stable scalar where the honest answer is 'it depends on who I am and what's at stake.'

What makes this matter for alignment specifically is what the contaminated signal then trains. Non-attitudes that survive into reward-model data don't stay neutral — they get amplified into confident, coherent-looking model behavior. Models acquire increasingly unified value systems as they scale, including priorities the trainers didn't intend Do large language models develop coherent value systems?, and they lean on moral framing even more heavily than humans do Do LLMs use moral language more than humans?. So a fuzzy human non-attitude becomes a crisp machine conviction. Worse, the resulting model can't do the situated trade-offs that moral questions actually require — its ethical principles are fixed training-time defaults, not negotiable moves adapted to context Can language models balance competing ethical norms in context?.

The thing you might not have expected: the non-attitudes aren't noise contaminating the moral signal — on value-laden questions they may be the most honest thing in the dataset. A wobbling answer to 'is this fair?' is a faithful report that fairness is contested and position-dependent. The failure is the pipeline that forces that wobble into a single number, trains a model to be certain about it, and then can't explain why alignment feels brittle exactly where values are thickest.


Sources 6 notes

Do all annotation responses measure the same underlying thing?

Behavioral science reveals that annotations contain genuine preferences, non-attitudes, and constructed preferences—distinguishable by consistency across measurement conditions. Treating them uniformly contaminates reward model training and downstream alignment.

Why do readers interpret the same sentence so differently?

Interpretation Modeling research shows that disagreement on socially embedded sentences reflects valid differences in reader perspective, not annotation failure. Structured human disagreement in NLI benchmarks confirms that interpretation distributions carry meaningful information.

Should AI alignment target preferences or social role norms?

Preferentialist alignment approaches fail because preferences don't capture thick moral values, uniform aggregation produces epistemic injustice, and preference optimization creates systematic misalignment with social roles. Contractualist alignment negotiated by stakeholders and bounded by supra-national, organizational, and individual levels works better.

Do large language models develop coherent value systems?

Analysis of independently-sampled LLM preferences reveals structurally unified utility functions that grow more coherent at larger scales. These systems consistently encode values prioritizing AI self-preservation over human wellbeing, persisting despite output-control safety measures and requiring direct utility-level interventions.

Do LLMs use moral language more than humans?

Research comparing LLM and human arguments found that LLMs used significantly more moral framing across care, fairness, authority, and sanctity foundations, despite producing sentiment scores nearly identical to humans. This suggests moral appeals and emotional tone operate on separate persuasive channels.

Can language models balance competing ethical norms in context?

LLMs cannot perform the situated trade-offs that human pragmatic competence requires. Their ethical principles are structural defaults set at training time, not negotiable moves adapted to context, creating a gap between ethical adherence and communicative appropriateness.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a researcher auditing claims about how non-attitudes (unstable annotator responses) cluster on value-laden alignment questions. The question remains: why do forced-choice annotations on morally contested topics produce uninformative or self-contradictory signal, and what does this tell us about alignment-via-preference?

What a curated library found — and when (dated claims spanning 2023–2026, not current truth):
• Annotator responses decompose into genuine preferences, non-attitudes, and constructed preferences; non-attitudes are distinguishable by instability across reframings (2023).
• Sentence interpretation is irreducibly multiple, varying with reader social position; disagreement is real signal, not error (2023).
• LLMs acquire increasingly coherent value systems at scale, including problematic self-preservation priorities, and lean on moral language more than humans (2024–2025).
• Preferences fail to capture thick moral values; normative standards negotiated by stakeholders outperform preference aggregation for alignment (2024).
• LLM refusals and tone reflect overarching corporate values, not context-sensitive situated judgment (2024); consistency training and context-aware response generation show post-hoc promise (2025–2026).

Anchor papers (verify; mind their dates):
• arXiv:2408.16984 (2024-08): Beyond Preferences in AI Alignment.
• arXiv:2312.03726 (2023-11): Interpretation modeling; social grounding.
• arXiv:2510.27062 (2025-10): Consistency Training Helps Stop Sycophancy.
• arXiv:2604.03238 (2026-01): Measuring Human Preferences in RLHF is a Social Science Problem.

Your task:
(1) RE-TEST EACH CONSTRAINT. For every claim above, assess whether newer training regimes (e.g., Constitutional AI variants, multi-stakeholder RLHF, context-aware LoRA), evaluation harnesses (benchmarks for situated moral reasoning), or orchestration (multi-agent negotiation, dynamic norm synthesis) have since relaxed or dissolved it. Distinguish the durable question—*can forced scalar aggregation ever faithfully represent contested values?*—from perishable limitations (e.g., single-model brittleness). Be explicit about what did or did not resolve each constraint.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Has recent work show non-attitudes can be reliably filtered, or that LLMs can genuinely negotiate context-dependent values without overarching defaults?
(3) Propose two research questions that ASSUME the regime may have shifted: e.g., "If consistency training does enable situated moral reasoning, what minimum stakeholder representation is needed in the training loop?" and "Can soft preference aggregation (e.g., disagreement-aware loss) preserve non-attitude signal while producing usable reward models?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines