SYNTHESIS NOTE
Conversational AI and Personalization Psychology, Society, and Alignment

Can training user simulators reduce persona drift in dialogue?

Explores whether inverting typical RL setups—training the simulated user for consistency rather than the task agent—can measurably reduce persona drift and improve experimental reliability in dialogue research.

Synthesis note · 2026-02-22 · sourced from Conversation Agents
What kind of thing is an LLM really? How should researchers navigate LLM reasoning research?

Prior work on persona-consistent dialogue treats user simulators as fixed environments against which task agents are trained. This paper inverts the setup: fix the task agent, and train the user simulator for consistency. The shift matters because unreliable user simulation distorts experimental results, introduces noise into policy learning, and misrepresents the humans being simulated.

Three complementary metrics capture distinct types of persona drift:

These capture local drift (within a turn), global drift (across the conversation), and factual drift (contradiction of established facts). Using LLM-as-a-Judge to compute these metrics and applying them as multi-turn RL reward signals reduces inconsistency by over 55%.

The persona drift problem is specific and well-documented: an LLM simulating a depressed patient may be "instantly cured" after a single conversational turn, or a simulated high-school student may suddenly demonstrate postgraduate-level reasoning. These are not edge cases — they are systematic consequences of RLHF training that "pushes LLMs to be helpful and harmless, thus adopting overly cheerful personas" that conflict with simulating depressed, disagreeable, or confused users.

Since Why does supervised learning fail to enforce persona consistency?, this paper extends the argument from offline RL to online multi-turn RL. The key advance: rather than human-annotated contradiction labels, LLM-as-a-Judge provides scalable automatic evaluation that can serve as a continuous training signal.

The three-metric decomposition also refines the understanding of drift. It is not a single phenomenon but at least three distinct failure types that can be measured and corrected independently.

Inquiring lines that use this note as a source 140

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 7

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
16 direct connections · 141 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

multi-turn rl for persona consistency reduces drift by 55 percent by treating simulated users as trainable agents rather than fixed environments