Why is persona consistency a pragmatic property rather than semantic?

This explores why staying 'in character' isn't about each sentence matching a fixed profile of facts (semantic), but about how utterances function in context — relative to a listener, the prior conversation, and what an utterance distinguishes (pragmatic).

This explores why persona consistency turns out to be a pragmatic property — something that lives in how an utterance works in context — rather than a semantic one you could check sentence-by-sentence against a list of traits. The corpus keeps arriving at the same conclusion from different directions: you cannot enforce consistency by checking statements in isolation, because what counts as 'consistent' depends on the listener, the discourse, and what a given line distinguishes from alternatives.

The sharpest version of the pragmatic case comes from work giving dialogue agents an *imaginary listener* Can imaginary listeners reduce dialogue agent contradictions?. Here the agent doesn't ask 'is this sentence true to my persona?' — it asks 'would this utterance let a listener tell my persona apart from a different one?' Consistency becomes a function of discriminability in the eyes of an audience, the textbook definition of pragmatics (meaning-in-use) rather than semantics (meaning-in-content). Tellingly, this works at inference time with no contradiction labels and no extra training — because the property being optimized was never really about the propositions, but about their communicative effect.

The same lesson shows up as a failure when people treat consistency semantically. Persona-adherence scores that just check whether outputs echo a character description reward copying the bio while ignoring the question being asked, which is why high persona fidelity trades off against discourse coherence unless the two are optimized together Do persona consistency metrics actually measure dialogue quality?. And supervised learning fails precisely because it only rewards correct content and never penalizes *contradiction in context* — restoring consistency requires explicit contradiction punishment, a relational signal between turns, not a property of any single line Why does supervised learning fail to enforce persona consistency?. The drift that RL methods target is itself defined relationally: local drift within a turn, global drift across a conversation, factual contradiction against earlier claims — all cross-utterance relations, none visible in a single statement Can training user simulators reduce persona drift in dialogue?.

Underneath this sits a deeper reason the property can't be semantic: there is no single fixed character to be semantically faithful to. An LLM holds a *superposition* of plausible simulacra that only narrows as the conversation supplies context, so each response samples from a distribution and 'consistency' means coherence with what's been said so far, not fidelity to a stored identity Does an LLM commit to a single character or maintain many?. This is also why the same persona prompt run twice can vary as much as two different personas — model uncertainty, not stable social knowledge, drives the output, so there's no semantic anchor to be consistent *with* Why do LLM persona prompts produce inconsistent outputs across runs?. Consistency has to be manufactured pragmatically, turn by turn, because the thing it would otherwise be a property *of* doesn't sit still.

Worth knowing, though: not everyone thinks this is the whole story. A realizationist line argues that post-training installs genuinely *stable dispositions* that resist jailbreaks and persist across conversations — closer to a standing trait than a per-turn performance Are RLHF personas performed characters or realized dispositions? Are LLM personas realized or merely simulated through training?. Even there, the stability is loose: persona space has a dominant 'Assistant' axis, and emotional or self-reflective conversation predictably pushes the model off it How stable is the trained Assistant personality in language models?. So the interesting tension the corpus leaves you with is that personas may be *realized* (a near-semantic claim) yet their moment-to-moment consistency still has to be *earned pragmatically* — the disposition exists, but holding to it in dialogue is a contextual achievement, not a fact you can read off the weights.

Sources 9 notes

Can imaginary listeners reduce dialogue agent contradictions?

Endowing dialogue agents with an imaginary listener via Rational Speech Acts reduces persona contradiction at inference time without NLI labels or extra training. The agent simulates whether utterances would distinguish its persona from a distractor, suppressing generic or contradictory responses.

Do persona consistency metrics actually measure dialogue quality?

High persona adherence scores often come from copying character descriptions while ignoring query relevance. MUDI jointly optimizes both by using discourse relations and graph-based coherence modeling alongside persona fidelity, showing that persona and context must be optimized together, not separately.

Why does supervised learning fail to enforce persona consistency?

Supervised learning cannot enforce persona consistency because it rewards correct responses but never penalizes contradictions. Offline reinforcement learning combines inexpensive training on existing data with explicit contradiction rewards using human-annotated labels, offering a practical alternative to expensive online RL.

Can training user simulators reduce persona drift in dialogue?

By inverting standard RL setups to train user simulators for consistency using three complementary metrics (prompt-to-line, line-to-line, Q&A consistency) as reward signals, persona drift decreases by over 55%. This approach captures distinct failure types: local drift within turns, global drift across conversations, and factual contradictions.

Does an LLM commit to a single character or maintain many?

Research shows LLMs don't commit to a single character but instead maintain a probability distribution over many consistent simulacra. Each response samples from this distribution, explaining why regenerations can yield different personalities while remaining consistent with prior context.

Why do LLM persona prompts produce inconsistent outputs across runs?

When the same persona prompt is run repeatedly, output variance across runs matches or exceeds variance across different personas. This reveals that model uncertainty, not stable social knowledge, drives persona-simulated outputs, making them unsuitable for simulating human annotation disagreement.

Are RLHF personas performed characters or realized dispositions?

Post-training installs stable dispositional profiles that persist under adversarial pressure, marking them as realized rather than performed. The stickiness of trained personas across conversations distinguishes them from prompt-induced role-play that collapses under jailbreaks.

Are LLM personas realized or merely simulated through training?

Post-training installs robust personas that resist adversarial pressure and persist as substrate-level dispositions, distinguishing realization from pretense. This quasi-realizationist account preserves explanatory power while treating LLMs as possessing genuine quasi-beliefs and quasi-desires.

How stable is the trained Assistant personality in language models?

Research mapping hundreds of character archetypes reveals a low-dimensional persona space where the leading component measures distance from the default Assistant. Emotional and meta-reflective conversations cause predictable drift, but activation capping along this axis mitigates harmful shifts without degrading capabilities.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-examining whether persona consistency in LLMs is truly a pragmatic property (meaning-in-use, context-dependent) or whether recent advances have revealed semantic or structural anchors. The question remains open: *what is the actual locus of persona stability in modern LLMs?*

What a curated library found — and when (dated claims, not current truth):
Findings span 2020–2026; treat these as perishable constraints to re-test:
• Consistency cannot be enforced by checking statements in isolation; it requires an imaginary listener and discriminability in dialogue context, not fidelity to a stored trait (2020, ~2024).
• Persona-adherence scores that score bio-echo vs. discourse coherence reveal a hard trade-off unless both are optimized together; supervised learning alone fails because it never penalizes cross-turn contradiction (~2023–2024).
• LLMs maintain a superposition of plausible simulacra that narrows only as conversation supplies context; the same prompt run twice varies as much as different personas, indicating model uncertainty, not stable identity (~2023–2024).
• Recent realizationist work argues post-training installs *genuinely stable dispositions* that resist jailbreaks and persist across conversations — closer to standing traits; yet the 'Assistant' axis dominates persona space, and emotional conversation predictably pushes models off it (2026).
• Multi-turn RL reduces drift by 55% by treating local, global, and factual contradiction as relational signals; persona vectors now monitor and control character traits in real time (2025–2026).

Anchor papers (verify; mind their dates):
• arXiv:2004.05816 (2020) — pragmatic self-consciousness via imaginary listener.
• arXiv:2310.10735 (2023) — offline RL for persona consistency.
• arXiv:2511.00222 (2025) — multi-turn RL reduces drift.
• arXiv:2601.10387 (2026) — the Assistant axis and default persona stability.

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For every finding above, assess whether newer models (o1, GPT-4o, Claude 3.5+), methods (constitutional AI, test-time persona steering), tooling (persona vector dashboards, multi-agent harnesses), or evaluation (PersonaGym benchmarks, multi-agent-as-judge) have since *relaxed or overturned* it. Does the pragmatic claim still hold if personas can now be *monitored and controlled* in real time via vectors? Separate the durable question (what makes consistency hard?) from perishable limitations (what made it hard before?). Cite what resolved it.
(2) **Surface the strongest CONTRADICTING or SUPERSEDING work** from the last ~6 months. Does realizationism (stable dispositions post-training) now dominate the corpus? If so, does that flip the pragmatic/semantic boundary?
(3) **Propose 2 research questions** that assume the regime *has* moved: e.g., 'If personas are now structurally realized via post-training + persona vectors, is consistency still pragmatic, or has it become a *hybrid* property — semantic anchor + pragmatic refinement?' or 'Can test-time persona steering eliminate the superposition, or does model uncertainty persist even under maximal constraint?'

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Why is persona consistency a pragmatic property rather than semantic?

Sources 9 notes

Next inquiring lines