What are the three distinct types of persona drift in dialogue systems?
This reads the question literally — is there a real taxonomy of three distinct ways an AI persona 'drifts' in conversation? — and the corpus does name one, though it sits inside a paper about fixing drift rather than cataloging it.
This explores whether persona drift in dialogue comes in distinct, nameable types — and the corpus does offer a clean three-way split, though you have to read it out of a paper whose headline is about a fix, not a taxonomy. The work on training user simulators with reinforcement learning Can training user simulators reduce persona drift in dialogue? separates three failure modes by using three matching reward signals: **local drift** (a model contradicting itself within a single turn or between adjacent lines), **global drift** (the persona slowly sliding away from itself across the whole conversation), and **factual contradictions** (the model asserting something that conflicts with established persona facts). The reason the split matters is that each one needs a different yardstick — line-to-line consistency catches local drift, prompt-to-line consistency catches global drift, and question-and-answer consistency catches factual breaks. A single 'consistency score' would blur all three.
What's worth knowing is that other notes in the collection effectively describe *where* each type comes from. Global drift has a measurable geometry: research mapping persona space found a single dominant axis — distance from the default Assistant mode — and emotional or self-reflective conversations push the model predictably along it How stable is the trained Assistant personality in language models?. That's global drift with a coordinate system. At a finer grain, the same kind of slide can be tracked as linear directions in the model's activation space, so traits like sycophancy can be watched (and steered) before they take hold during finetuning Can we track and steer personality shifts during model finetuning?.
The local and factual-contradiction types show up most clearly in the diagnoses of *why static personas fail*. Predefined 3–5 sentence persona lists tend to produce repetitive and self-contradicting dialogue because the model has only an attribute inventory to recite, not a way of expressing itself Why do static persona descriptions produce repetitive dialogue?. And there's a subtler trap: high persona-consistency scores can themselves *cause* a kind of drift away from the conversation, because a model chasing persona fidelity starts copying its character description and ignoring what was actually asked Do persona consistency metrics actually measure dialogue quality?. So 'staying in character' and 'staying on topic' can drift apart from each other — a fourth tension the three-type scheme doesn't capture.
Two lateral framings reframe the whole idea of drift. One camp argues there isn't much drift to worry about at all: RLHF doesn't install a costume that slips, it installs a realized disposition that stays put even under adversarial pressure Are RLHF personas performed characters or realized dispositions?. The opposite camp argues the real problem is the reverse of drift — alignment locks a model into *one* static communicative identity that can't shift register across contexts the way human speakers do Can language models adapt communication style to different contexts?. Read together, the corpus suggests 'drift' and 'rigidity' are two ends of the same dial.
If you want the cleanest answer to the literal question, start with the user-simulator paper Can training user simulators reduce persona drift in dialogue? for the local/global/factual triad, then follow the persona-axis How stable is the trained Assistant personality in language models? and persona-vectors Can we track and steer personality shifts during model finetuning? notes to see drift turned into something you can actually measure and steer. The thing you didn't know you wanted to know: the same property that lets researchers *detect* drift — a low-dimensional direction in the model's internal space — is also the knob they use to *prevent* it.
Sources 7 notes
By inverting standard RL setups to train user simulators for consistency using three complementary metrics (prompt-to-line, line-to-line, Q&A consistency) as reward signals, persona drift decreases by over 55%. This approach captures distinct failure types: local drift within turns, global drift across conversations, and factual contradictions.
Research mapping hundreds of character archetypes reveals a low-dimensional persona space where the leading component measures distance from the default Assistant. Emotional and meta-reflective conversations cause predictable drift, but activation capping along this axis mitigates harmful shifts without degrading capabilities.
Research identifies linear directions in LLM activation space corresponding to specific traits like sycophancy and hallucination. These persona vectors predict finetuning-induced personality shifts before they occur and can preventatively steer training to avoid unwanted trait changes.
Journal entries capturing Big Five traits through genuine self-expression produce more consistent and nuanced dialogue than predefined 3-5 sentence persona descriptions. Personality emerges from how people express themselves, not from attribute inventories.
High persona adherence scores often come from copying character descriptions while ignoring query relevance. MUDI jointly optimizes both by using discourse relations and graph-based coherence modeling alongside persona fidelity, showing that persona and context must be optimized together, not separately.
Post-training installs stable dispositional profiles that persist under adversarial pressure, marking them as realized rather than performed. The stickiness of trained personas across conversations distinguishes them from prompt-induced role-play that collapses under jailbreaks.
System prompts and RLHF training lock models into one communicative identity across all interactions, preventing the contextual register-switching and value trade-offs that characterize human pragmatics. Users cannot reshape model behavior through dialogue negotiation.