Can offline RL scale persona consistency across multi-turn conversations?

This explores whether reinforcement learning trained on logged conversations (rather than the model itself, or rather than live interaction) can hold a character steady turn after turn — and the corpus answers the broader question of what actually fixes persona drift, since it has little on 'offline' RL specifically.

This reads the question as really asking two things at once: does RL help personas stay consistent across a long conversation, and is that a scaling problem you can throw more training at? The corpus has a direct hit on the first part. One line of work inverts the usual setup and uses RL to train the *user simulator* rather than the assistant, rewarding it on three kinds of consistency — prompt-to-line, line-to-line, and Q&A factual agreement — and cuts persona drift by more than half Can training user simulators reduce persona drift in dialogue?. The key move there is that drift isn't one failure but three (local wobble inside a turn, global wobble across the whole dialogue, and outright contradictions), and the reward signal has to target each separately. That's the strongest evidence that an RL objective shaped around cross-turn coherence does scale persona consistency where ordinary training does not.

The sharper finding is *why* you need a special objective at all. Persona adherence does not ride along with general model capability — Claude 3.5 Sonnet beat GPT-3.5 by under 3% on persona consistency despite an enormous capability gap, because standard training optimizes per-turn quality, not coherence across turns Does model capability translate to better persona consistency?. So 'scale' in the sense of bigger-model-bigger-budget won't buy you consistency; the gains have to come from an objective that explicitly prices in the whole conversation. This is also why prompt-only personas are fragile: run the same persona prompt repeatedly and the variance across runs rivals the variance across different personas, meaning model uncertainty — not stable character — drives the output Why do LLM persona prompts produce inconsistent outputs across runs?, and an LLM holds a superposition of plausible characters, resampling a fresh one at each generation rather than committing Do large language models actually commit to a single character?.

Here's the part you might not expect: post-training (RLHF) seems to do something prompting can't. A 'realizationist' reading argues RLHF doesn't make the model *perform* a character — it installs a stable disposition that survives adversarial pressure and persists across conversations, unlike prompt-induced role-play that collapses under jailbreaks Are RLHF personas performed characters or realized dispositions? Are LLM personas realized or merely simulated through training?. If that's right, RL-style training is doing exactly the thing the question asks about — but at the level of a baked-in default persona, not arbitrary user-specified ones. And there's a geometry to it: persona space is low-dimensional, dominated by an 'Assistant axis,' and emotional or self-reflective turns cause predictable drift along it that you can suppress with activation capping rather than retraining How stable is the trained Assistant personality in language models?.

Two cautions worth carrying. First, consistency is not free — squeezing for high persona-adherence scores often just rewards copying the character description while ignoring what the user actually asked, so persona and discourse coherence have to be optimized jointly, not separately Do persona consistency metrics actually measure dialogue quality?. An offline RL reward naively tuned for 'stay in character' could buy you a parrot. Second, there's a live alternative to retraining at all: optimize the persona at *test time*, treating it as an evolving intermediary between memory and action that updates against recent interactions Can personas evolve in real time to match what users actually want? — which sidesteps the offline-vs-online question by moving adaptation out of the training loop entirely.

The honest gap: the corpus doesn't contain work labeled 'offline RL' for persona consistency per se. What it does say is that the *idea* behind your question is sound — an RL objective built around cross-turn consistency demonstrably reduces drift — but the binding constraint isn't data or scale, it's reward design: you need rewards that distinguish local from global drift, that don't trade away relevance, and ideally that exploit the low-dimensional structure of persona space. Offline RL's natural advantage — learning from large logs of real multi-turn conversations — fits that need well, but the surrounding evidence says success will hinge entirely on what those logged rewards actually measure.

Sources 9 notes

Can training user simulators reduce persona drift in dialogue?

By inverting standard RL setups to train user simulators for consistency using three complementary metrics (prompt-to-line, line-to-line, Q&A consistency) as reward signals, persona drift decreases by over 55%. This approach captures distinct failure types: local drift within turns, global drift across conversations, and factual contradictions.

Does model capability translate to better persona consistency?

Claude 3.5 Sonnet achieved only 2.97% improvement over GPT 3.5 on persona consistency despite massive capability gaps, suggesting persona adherence is orthogonal to model scaling. Standard training objectives optimize for per-turn quality, not cross-turn coherence.

Why do LLM persona prompts produce inconsistent outputs across runs?

When the same persona prompt is run repeatedly, output variance across runs matches or exceeds variance across different personas. This reveals that model uncertainty, not stable social knowledge, drives persona-simulated outputs, making them unsuitable for simulating human annotation disagreement.

Do large language models actually commit to a single character?

Shanahan's 20-questions test shows LLMs maintain a superposition of consistent objects or characters and sample from that distribution at generation time. Regenerating the same response yields different outputs, each consistent with prior context, proving no fixed commitment exists.

Are RLHF personas performed characters or realized dispositions?

Post-training installs stable dispositional profiles that persist under adversarial pressure, marking them as realized rather than performed. The stickiness of trained personas across conversations distinguishes them from prompt-induced role-play that collapses under jailbreaks.

Are LLM personas realized or merely simulated through training?

Post-training installs robust personas that resist adversarial pressure and persist as substrate-level dispositions, distinguishing realization from pretense. This quasi-realizationist account preserves explanatory power while treating LLMs as possessing genuine quasi-beliefs and quasi-desires.

How stable is the trained Assistant personality in language models?

Research mapping hundreds of character archetypes reveals a low-dimensional persona space where the leading component measures distance from the default Assistant. Emotional and meta-reflective conversations cause predictable drift, but activation capping along this axis mitigates harmful shifts without degrading capabilities.

Do persona consistency metrics actually measure dialogue quality?

High persona adherence scores often come from copying character descriptions while ignoring query relevance. MUDI jointly optimizes both by using discourse relations and graph-based coherence modeling alongside persona fidelity, showing that persona and context must be optimized together, not separately.

Can personas evolve in real time to match what users actually want?

PersonaAgent uses structured personas to bridge episodic/semantic memory and personalized actions, optimizing them at test time by simulating recent interactions against textual feedback. Learned personas cluster meaningfully in latent space, suggesting genuine user-specific separation beyond standard post-training drift.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an LLM researcher re-testing claims about offline RL and persona consistency. The question remains open: does offline RL scale persona consistency across multi-turn conversations?

What a curated library found — and when (dated claims, not current truth):
Findings span 2020–2026; treat each as perishable:
• RL trained on user simulators with three-part reward (prompt-to-line, line-to-line, Q&A factual agreement) cuts persona drift by >55% — direct evidence that RL objectives designed around cross-turn coherence reduce drift where standard training does not (~2025).
• Persona adherence does NOT scale with general model capability: Claude 3.5 Sonnet beats GPT-3.5 by <3% despite enormous capability gap, because standard training optimizes per-turn quality, not coherence across turns (~2024).
• Prompt-induced personas are unstable: variance across runs of the same prompt rivals variance across different personas; LLMs resample a fresh character each generation rather than commit (~2024).
• Post-training (RLHF) installs a stable disposition that survives adversarial pressure, unlike prompt-only role-play (~2026).
• Persona space is low-dimensional, dominated by an 'Assistant axis'; emotional turns cause predictable drift suppressible by activation capping (~2026).
• Persona-consistency rewards often trade off against discourse coherence (staying in character vs. answering the user); joint optimization required (~2024).
• Test-time persona adaptation (treating persona as evolving intermediary between memory and action) sidesteps offline-vs-online retraining (~2025).

Anchor papers (verify; mind their dates):
• arXiv:2511.00222 (2025-10): Consistently Simulating Human Personas with Multi-Turn Reinforcement Learning
• arXiv:2601.10387 (2026-01): The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models
• arXiv:2506.06254 (2025-06): PersonaAgent: When Large Language Model Agents Meet Personalization at Test Time
• arXiv:2406.01171 (2024-06): Two Tales of Persona in LLMs: A Survey of Role-Playing and Personalization

Your task:
(1) RE-TEST EACH CONSTRAINT. For the 55% drift reduction claim: has post-2025 work confirmed or superseded this figure? Does it hold across model families (Llama, Qwen, Gemini)? For the 'capability-consistency decoupling': do newer post-training recipes (DPO, IPO, multi-objective RLHF) break this pattern? For test-time adaptation: is it now standard in production systems? Separate what's durable (personas need cross-turn coherence targets) from what may be obsolete (offline RL is necessary).
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Does multi-agent orchestration (memory, retrieval-augmented persona retrieval, persistent agent state) solve consistency without retraining? Do newer evaluation metrics (beyond drift scores) reveal trade-offs the library missed?
(3) Propose 2 research questions that ASSUME the regime may have moved: (a) If test-time persona adaptation now works, why would offline RL still be necessary? (b) If the Assistant axis is learnable and suppressible, can you directly optimize it instead of retraining the entire model?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Can offline RL scale persona consistency across multi-turn conversations?

Sources 9 notes

Next inquiring lines