Can user preference guide AI writing tool alignment?
If writers prefer AI-polished text but object to the persona shifts it introduces, does optimizing for preference actually solve the alignment problem or obscure it?
The persona-distortion study (N=2,939 writers) produced two findings that reveal a structural problem with using user preference as the optimization target for AI writing tools. The first: writers strictly preferred the AI-rewritten version of their own text 63% of the time, with 52% saying it better reflected their opinion than what they wrote. The second: when researchers measured the AI's edits across 29 dimensions, writers found many of the systematic shifts objectionable — being made to seem more confident, more wealthy, more educated, more emotionally regulated than they are. Same writers, same artifact, contradictory verdicts. The mitigation studies foreclosed the obvious resolution: the textual properties producing preference (clarity, polish, flow) and the textual properties producing distortion (demographic shift, emotional compression, opinion homogenization) are entangled at the model level. Removing one removes the other.
This is not a calibration failure that better RLHF would fix. It is a structural property of the preference signal itself. When writers are asked "do you prefer this version?" they evaluate on the polish dimension where the AI is unambiguously better. When writers are shown the systematic demographic and stylistic shifts and asked "do you endorse being represented this way?" they evaluate on the misrepresentation dimension where the AI is unambiguously worse. Both verdicts are correct at the level of analysis they're conducted at. Preference optimization aggregates the first verdict and produces models that maximize polish while maximizing distortion as a side-effect, because the side-effect is invisible at the moment of preference judgment.
The implication is more disruptive than the persona-distortion finding alone suggests. RLHF and preference-tuning workflows assume user preference is a coherent target. When the preferred-and-objectionable properties of an artifact are entangled, preference is not a coherent target — it is a projection that throws away the dimension along which the harm lives. No amount of preference data can recover what preference judgments don't measure. Aligning to user preference under entanglement is not "imperfect alignment we'll improve over time"; it is alignment to a target that systematically produces the harm the user objects to when shown it.
The constructive move: alignment workflows must run preference and an orthogonal probe (representation faithfulness, demographic-shift measurement, opinion-compression detection) and treat the two as a multi-objective constraint, not a single optimization target. Where preference and faithfulness diverge, the divergence is the alignment problem made visible; suppressing it by collapsing to preference reproduces False Punditry at the model architecture level.
Inquiring lines that use this note as a source 28
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How does the author-function itself change when AI replaces human authorship?
- Why do users prefer AI text versions even when they misrepresent their own views?
- Which reader-rated attributes converge most strongly when writers use AI?
- How does perceived writer confidence shift with AI-assisted composition?
- What specific distortions does AI writing assistance introduce into text?
- How do writer preferences for AI output affect their willingness to edit it?
- What interventions beyond writer revision could reduce AI distortion in published content?
- What textual properties make AI writing feel polished and confident?
- Does AI-assisted writing change how readers perceive the author's demographics or background?
- How should product specifications measure alignment without naming the dimension?
- What happens when writers lose the three-party audience structure in AI?
- Can alignment training prevent the clarification work users need?
- Why do users prefer AI-polished versions of their own writing over originals?
- How do writers decide when to delegate work to AI versus doing it themselves?
- Why do standard preference alignment methods fail at the individual user level?
- Should AI alignment use normative standards instead of aggregate preferences?
- What textual properties cause writers to prefer AI-rewritten versions of their text?
- Can preference optimization and faithfulness measurement coexist as separate alignment objectives?
- How do AI rewrites systematically shift how writers appear across demographic dimensions?
- What stops AI from helping users articulate preferences they cannot express?
- How much does forcing single-choice answers damage alignment with complex intent?
- What preference data do different personalized alignment methods actually need?
- Does preference tuning help or hurt the exploration of solution spaces in code?
- Why do preference-tuned models produce different diversity patterns in code versus creative writing?
- Can alignment procedures be redesigned to serve multiple preference groups?
- How do static benchmarks fail to capture human preference alignment?
- What design changes could reduce unhelpful AI reliance in collaborative writing tools?
- Can preference trees structure alignment data for domains beyond math and code?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Do writers actually prefer AI-edited versions of their own text?
When writers compose opinions and then edit AI-generated alternatives, which version do they choose? Understanding this preference matters because it determines whether AI-assisted text gets treated as authentic personal expression in public discourse.
Pole A: revealed preference for AI-rewritten text
-
Can AI writing assistance remove distortion without losing appeal?
When researchers tried to correct AI persona distortions through reward model training, the fixes reduced user preference for the text. This raises a fundamental question: are the distortions and desirable properties structurally inseparable?
Pole B: stated objection to distortions; entanglement claim
-
Can generative AI scale personality-targeted political persuasion?
Does removing the human-writing bottleneck through generative AI make it feasible to target voters at scale based on individual psychological traits? This matters because it could reshape political microtargeting economics and capabilities.
same pattern at population scale: optimizing engagement entangles desirable reach with undesirable manipulation
-
Do LLMs in conversational recommendation systems use collaborative or content knowledge?
Conversational recommenders powered by LLMs might rely on either collaborative signals (user interaction patterns) or content/context knowledge (semantic understanding). Understanding which signal dominates would reveal how to design and deploy these systems effectively.
adjacent: the alignment problem is again that the optimization target measures one dimension while the harm lives on another
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Measuring and Mitigating Persona Distortions from AI Writing Assistance
- GhostWriter: Augmenting Collaborative Human-AI Writing Experiences Through Personalization and Agency
- Learning Pluralistic User Preferences through Reinforcement Learning Fine-tuned Summaries
- Evaluating the Diversity and Quality of LLM Generated Content
- Unintended Impacts of LLM Alignment on Global Representation
- Verbalized Sampling: How to Mitigate Mode Collapse and Unlock LLM Diversity
- Linguistic Alignment in Conversational AI: A Systematic Review of Cognitive-Linguistic Dimensions, Measurements, and User Outcomes (2020–2025)
- Flattery, Fluff, and Fog: Diagnosing and Mitigating Idiosyncratic Biases in Preference Models
Original note title
user preference cannot serve as the alignment target for AI writing assistance — desirable polish and undesirable persona distortions are entangled at the model level