Why is persona consistency a pragmatic property rather than semantic?
This explores why staying 'in character' isn't about each sentence matching a fixed profile of facts (semantic), but about how utterances function in context — relative to a listener, the prior conversation, and what an utterance distinguishes (pragmatic).
This explores why persona consistency turns out to be a pragmatic property — something that lives in how an utterance works in context — rather than a semantic one you could check sentence-by-sentence against a list of traits. The corpus keeps arriving at the same conclusion from different directions: you cannot enforce consistency by checking statements in isolation, because what counts as 'consistent' depends on the listener, the discourse, and what a given line distinguishes from alternatives.
The sharpest version of the pragmatic case comes from work giving dialogue agents an *imaginary listener* Can imaginary listeners reduce dialogue agent contradictions?. Here the agent doesn't ask 'is this sentence true to my persona?' — it asks 'would this utterance let a listener tell my persona apart from a different one?' Consistency becomes a function of discriminability in the eyes of an audience, the textbook definition of pragmatics (meaning-in-use) rather than semantics (meaning-in-content). Tellingly, this works at inference time with no contradiction labels and no extra training — because the property being optimized was never really about the propositions, but about their communicative effect.
The same lesson shows up as a failure when people treat consistency semantically. Persona-adherence scores that just check whether outputs echo a character description reward copying the bio while ignoring the question being asked, which is why high persona fidelity trades off against discourse coherence unless the two are optimized together Do persona consistency metrics actually measure dialogue quality?. And supervised learning fails precisely because it only rewards correct content and never penalizes *contradiction in context* — restoring consistency requires explicit contradiction punishment, a relational signal between turns, not a property of any single line Why does supervised learning fail to enforce persona consistency?. The drift that RL methods target is itself defined relationally: local drift within a turn, global drift across a conversation, factual contradiction against earlier claims — all cross-utterance relations, none visible in a single statement Can training user simulators reduce persona drift in dialogue?.
Underneath this sits a deeper reason the property can't be semantic: there is no single fixed character to be semantically faithful to. An LLM holds a *superposition* of plausible simulacra that only narrows as the conversation supplies context, so each response samples from a distribution and 'consistency' means coherence with what's been said so far, not fidelity to a stored identity Does an LLM commit to a single character or maintain many?. This is also why the same persona prompt run twice can vary as much as two different personas — model uncertainty, not stable social knowledge, drives the output, so there's no semantic anchor to be consistent *with* Why do LLM persona prompts produce inconsistent outputs across runs?. Consistency has to be manufactured pragmatically, turn by turn, because the thing it would otherwise be a property *of* doesn't sit still.
Worth knowing, though: not everyone thinks this is the whole story. A realizationist line argues that post-training installs genuinely *stable dispositions* that resist jailbreaks and persist across conversations — closer to a standing trait than a per-turn performance Are RLHF personas performed characters or realized dispositions? Are LLM personas realized or merely simulated through training?. Even there, the stability is loose: persona space has a dominant 'Assistant' axis, and emotional or self-reflective conversation predictably pushes the model off it How stable is the trained Assistant personality in language models?. So the interesting tension the corpus leaves you with is that personas may be *realized* (a near-semantic claim) yet their moment-to-moment consistency still has to be *earned pragmatically* — the disposition exists, but holding to it in dialogue is a contextual achievement, not a fact you can read off the weights.
Sources 9 notes
Endowing dialogue agents with an imaginary listener via Rational Speech Acts reduces persona contradiction at inference time without NLI labels or extra training. The agent simulates whether utterances would distinguish its persona from a distractor, suppressing generic or contradictory responses.
High persona adherence scores often come from copying character descriptions while ignoring query relevance. MUDI jointly optimizes both by using discourse relations and graph-based coherence modeling alongside persona fidelity, showing that persona and context must be optimized together, not separately.
Supervised learning cannot enforce persona consistency because it rewards correct responses but never penalizes contradictions. Offline reinforcement learning combines inexpensive training on existing data with explicit contradiction rewards using human-annotated labels, offering a practical alternative to expensive online RL.
By inverting standard RL setups to train user simulators for consistency using three complementary metrics (prompt-to-line, line-to-line, Q&A consistency) as reward signals, persona drift decreases by over 55%. This approach captures distinct failure types: local drift within turns, global drift across conversations, and factual contradictions.
Research shows LLMs don't commit to a single character but instead maintain a probability distribution over many consistent simulacra. Each response samples from this distribution, explaining why regenerations can yield different personalities while remaining consistent with prior context.
When the same persona prompt is run repeatedly, output variance across runs matches or exceeds variance across different personas. This reveals that model uncertainty, not stable social knowledge, drives persona-simulated outputs, making them unsuitable for simulating human annotation disagreement.
Post-training installs stable dispositional profiles that persist under adversarial pressure, marking them as realized rather than performed. The stickiness of trained personas across conversations distinguishes them from prompt-induced role-play that collapses under jailbreaks.
Post-training installs robust personas that resist adversarial pressure and persist as substrate-level dispositions, distinguishing realization from pretense. This quasi-realizationist account preserves explanatory power while treating LLMs as possessing genuine quasi-beliefs and quasi-desires.
Research mapping hundreds of character archetypes reveals a low-dimensional persona space where the leading component measures distance from the default Assistant. Emotional and meta-reflective conversations cause predictable drift, but activation capping along this axis mitigates harmful shifts without degrading capabilities.