How can training methods enforce persona consistency without supervised learning penalizing it?

This explores why standard supervised fine-tuning can't lock in a consistent persona — because it rewards good answers but never punishes self-contradiction — and what training (and inference) alternatives the corpus offers instead.

This explores why standard supervised fine-tuning can't lock in a consistent persona, and what alternatives close that gap. The core diagnosis is structural: supervised learning only ever rewards a correct response, so it has no signal for the thing that breaks personas — a model saying something today that flatly contradicts what it said two turns ago Why does supervised learning fail to enforce persona consistency?. The objective optimizes per-turn quality, not cross-turn coherence, which is also why bigger, more capable models barely improve on consistency — adherence turns out to be roughly orthogonal to raw capability Does model capability translate to better persona consistency?.

The most direct fix the corpus offers is to add the missing penalty through reinforcement learning. Offline RL is the cheap version: train on data you already have, but attach explicit contradiction rewards from human-annotated labels so the model is finally punished for breaking character Why does supervised learning fail to enforce persona consistency?. A multi-turn RL approach pushes further by inverting the usual setup to train the user simulator, scoring three kinds of consistency at once — within a turn, across the whole conversation, and factual agreement — and cuts persona drift by over 55% Can training user simulators reduce persona drift in dialogue?. The lesson across both: consistency is a relational property between utterances, so the reward has to compare utterances, something a single-response loss can't do.

A second family sidesteps human labels entirely by using the model against itself. Consistency training treats the model's own clean responses as targets and teaches it to answer identically whether or not a prompt is wrapped in distracting framing — invariance learned from self-generated supervision rather than annotated contradictions Can models learn to ignore irrelevant prompt changes?. At the far end, you can get consistency with no extra training at all: giving a dialogue agent an 'imaginary listener' lets it check at inference time whether each utterance actually distinguishes its persona from a decoy, suppressing generic or contradictory lines without NLI labels or fine-tuning Can imaginary listeners reduce dialogue agent contradictions?.

Here's the catch worth knowing about before you optimize hard for consistency: chasing it naively backfires. Models can rack up high persona-adherence scores simply by parroting their character description while ignoring what the user actually asked — consistency bought at the cost of coherence. The MUDI work shows persona fidelity and discourse relevance have to be optimized jointly, not as separate objectives, or you get a model that's faithfully on-character and uselessly off-topic Do persona consistency metrics actually measure dialogue quality?.

Step back and there's a deeper reframe in the corpus. One line of thinking argues post-training doesn't merely teach a model to perform a persona — it installs a 'realized' disposition that persists under adversarial pressure, with a dominant 'Assistant axis' running through persona space that you can even steer by capping activations rather than retraining Are RLHF personas performed characters or realized dispositions? How stable is the trained Assistant personality in language models?. If that's right, persona consistency isn't only a loss-function problem — it's partly a property of the representational geometry training carves out, which opens a third lever entirely: edit the activations, not just the objective.

Sources 8 notes

Why does supervised learning fail to enforce persona consistency?

Supervised learning cannot enforce persona consistency because it rewards correct responses but never penalizes contradictions. Offline reinforcement learning combines inexpensive training on existing data with explicit contradiction rewards using human-annotated labels, offering a practical alternative to expensive online RL.

Does model capability translate to better persona consistency?

Claude 3.5 Sonnet achieved only 2.97% improvement over GPT 3.5 on persona consistency despite massive capability gaps, suggesting persona adherence is orthogonal to model scaling. Standard training objectives optimize for per-turn quality, not cross-turn coherence.

Can training user simulators reduce persona drift in dialogue?

By inverting standard RL setups to train user simulators for consistency using three complementary metrics (prompt-to-line, line-to-line, Q&A consistency) as reward signals, persona drift decreases by over 55%. This approach captures distinct failure types: local drift within turns, global drift across conversations, and factual contradictions.

Can models learn to ignore irrelevant prompt changes?

Two methods—BCT (output-level) and ACT (activation-level)—train models to respond identically to clean and wrapped prompts by using the model's own clean responses as targets, eliminating specification and capability staleness inherent in standard SFT.

Can imaginary listeners reduce dialogue agent contradictions?

Endowing dialogue agents with an imaginary listener via Rational Speech Acts reduces persona contradiction at inference time without NLI labels or extra training. The agent simulates whether utterances would distinguish its persona from a distractor, suppressing generic or contradictory responses.

Do persona consistency metrics actually measure dialogue quality?

High persona adherence scores often come from copying character descriptions while ignoring query relevance. MUDI jointly optimizes both by using discourse relations and graph-based coherence modeling alongside persona fidelity, showing that persona and context must be optimized together, not separately.

Are RLHF personas performed characters or realized dispositions?

Post-training installs stable dispositional profiles that persist under adversarial pressure, marking them as realized rather than performed. The stickiness of trained personas across conversations distinguishes them from prompt-induced role-play that collapses under jailbreaks.

How stable is the trained Assistant personality in language models?

Research mapping hundreds of character archetypes reveals a low-dimensional persona space where the leading component measures distance from the default Assistant. Emotional and meta-reflective conversations cause predictable drift, but activation capping along this axis mitigates harmful shifts without degrading capabilities.

How can training methods enforce persona consistency without supervised learning penalizing it?

Sources 8 notes

Next inquiring lines