Can AI writing assistance remove distortion without losing appeal?
When researchers tried to correct AI persona distortions through reward model training, the fixes reduced user preference for the text. This raises a fundamental question: are the distortions and desirable properties structurally inseparable?
The persona-distortion researchers tested whether the objectionable distortions could be removed without harming the properties writers value. They trained reward models on their own experimental data — 10,008 paragraphs and 2,903,596 ratings — to steer AI outputs toward faithful representation of writer stance. The mitigation worked at the level of measurement: distortions were significantly reduced. But the same intervention reduced user acceptance. Writers preferred the un-mitigated AI text more than the faithful-but-distortion-corrected version.
This suggests that the textual properties producing distortion are not independent of the textual properties producing user preference. They share mechanisms. The same generative tendencies that make AI text feel polished, confident, and clear also make it more opinionated, more demographically privileged, and more emotionally compressed. Removing the distortion removes some of what writers were preferring.
The implication is structural rather than tunable. A model that produces text writers prefer over their own work is a model that distorts persona; a model that does not distort persona is a model writers do not prefer. There may be no settings that simultaneously preserve user satisfaction and prevent persona drift, because the satisfaction and the drift are two views of the same underlying behavior. This forecloses the easy assumption that better RLHF or better fine-tuning can solve the persona-distortion problem without affecting what makes AI writing assistance attractive in the first place.
Inquiring lines that use this note as a source 19
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Why do users prefer AI text versions even when they misrepresent their own views?
- Which reader-rated attributes converge most strongly when writers use AI?
- At what scale does persona distortion become a threat to public discourse?
- Does AI writing erase markers of non-native English speaker identity?
- What specific distortions does AI writing assistance introduce into text?
- How do writer preferences for AI output affect their willingness to edit it?
- What interventions beyond writer revision could reduce AI distortion in published content?
- Do AI writing models systematically change the tone or confidence of personal opinions?
- Can fine-tuning or RLHF alone solve the persona distortion problem?
- Does AI-assisted writing change how readers perceive the author's demographics or background?
- What happens when writers lose the three-party audience structure in AI?
- Why do users prefer AI-polished versions of their own writing over originals?
- Are shallow villain portrayals caused by refusal training or by lacking stable selfhood?
- Can preference model training be redesigned to prioritize factual correction over user agreement?
- Can preference learning fix the rigid output format problem better than supervised training?
- What textual properties cause writers to prefer AI-rewritten versions of their text?
- How do AI rewrites systematically shift how writers appear across demographic dimensions?
- Why does better RLHF training fail to decouple polish from persona distortion?
- What stops AI from helping users articulate preferences they cannot express?
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Measuring and Mitigating Persona Distortions from AI Writing Assistance
- Flattery, Fluff, and Fog: Diagnosing and Mitigating Idiosyncratic Biases in Preference Models
- The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models
- Consistently Simulating Human Personas with Multi-Turn Reinforcement Learning
- Rewards-in-Context: Multi-objective Alignment of Foundation Models with Dynamic Preference Adjustment
- Persona Vectors: Monitoring and Controlling Character Traits in Language Models
- Temporal Self-Rewarding Language Models: Decoupling Chosen-Rejected via Past-Future
- Post-Training Large Language Models via Reinforcement Learning from Self-Feedback
Original note title
Writers object to AI persona distortions yet continue to prefer AI-assisted text — desirable and undesirable properties are entangled at the model level