Can user preference guide AI writing tool alignment?

If writers prefer AI-polished text but object to the persona shifts it introduces, does optimizing for preference actually solve the alignment problem or obscure it?

Synthesis note · 2026-05-03 · sourced from Co Writing Collaboration

The persona-distortion study (N=2,939 writers) produced two findings that reveal a structural problem with using user preference as the optimization target for AI writing tools. The first: writers strictly preferred the AI-rewritten version of their own text 63% of the time, with 52% saying it better reflected their opinion than what they wrote. The second: when researchers measured the AI's edits across 29 dimensions, writers found many of the systematic shifts objectionable — being made to seem more confident, more wealthy, more educated, more emotionally regulated than they are. Same writers, same artifact, contradictory verdicts. The mitigation studies foreclosed the obvious resolution: the textual properties producing preference (clarity, polish, flow) and the textual properties producing distortion (demographic shift, emotional compression, opinion homogenization) are entangled at the model level. Removing one removes the other.

This is not a calibration failure that better RLHF would fix. It is a structural property of the preference signal itself. When writers are asked "do you prefer this version?" they evaluate on the polish dimension where the AI is unambiguously better. When writers are shown the systematic demographic and stylistic shifts and asked "do you endorse being represented this way?" they evaluate on the misrepresentation dimension where the AI is unambiguously worse. Both verdicts are correct at the level of analysis they're conducted at. Preference optimization aggregates the first verdict and produces models that maximize polish while maximizing distortion as a side-effect, because the side-effect is invisible at the moment of preference judgment.

The implication is more disruptive than the persona-distortion finding alone suggests. RLHF and preference-tuning workflows assume user preference is a coherent target. When the preferred-and-objectionable properties of an artifact are entangled, preference is not a coherent target — it is a projection that throws away the dimension along which the harm lives. No amount of preference data can recover what preference judgments don't measure. Aligning to user preference under entanglement is not "imperfect alignment we'll improve over time"; it is alignment to a target that systematically produces the harm the user objects to when shown it.

The constructive move: alignment workflows must run preference and an orthogonal probe (representation faithfulness, demographic-shift measurement, opinion-compression detection) and treat the two as a multi-objective constraint, not a single optimization target. Where preference and faithfulness diverge, the divergence is the alignment problem made visible; suppressing it by collapsing to preference reproduces False Punditry at the model architecture level.

Inquiring lines that use this note as a source 28

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

14 direct connections · 174 in 2-hop network ·dense cluster Open in graph ↗

Can user preference guide AI writing tool alignm… Do writers actually prefer AI-edited versions of t… Can AI writing assistance remove distortion withou… Can generative AI scale personality-targeted polit… Do LLMs in conversational recommendation systems u…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Do writers actually prefer AI-edited versions of their own text? When writers compose opinions and then edit AI-generated alternatives, which version do they choose? Understanding this preference matters because it determines whether AI-assisted text gets treated as authentic personal expression in public discourse.
Pole A: revealed preference for AI-rewritten text
Can AI writing assistance remove distortion without losing appeal? When researchers tried to correct AI persona distortions through reward model training, the fixes reduced user preference for the text. This raises a fundamental question: are the distortions and desirable properties structurally inseparable?
Pole B: stated objection to distortions; entanglement claim
Can generative AI scale personality-targeted political persuasion? Does removing the human-writing bottleneck through generative AI make it feasible to target voters at scale based on individual psychological traits? This matters because it could reshape political microtargeting economics and capabilities.
same pattern at population scale: optimizing engagement entangles desirable reach with undesirable manipulation
Do LLMs in conversational recommendation systems use collaborative or content knowledge? Conversational recommenders powered by LLMs might rely on either collaborative signals (user interaction patterns) or content/context knowledge (semantic understanding). Understanding which signal dominates would reveal how to design and deploy these systems effectively.
adjacent: the alignment problem is again that the optimization target measures one dimension while the harm lives on another

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

user preference cannot serve as the alignment target for AI writing assistance — desirable polish and undesirable persona distortions are entangled at the model level

Can user preference guide AI writing tool alignment?

Related concepts in this collection 4

Related papers in this collection 8

Search by related questions 5