How does post-training stickiness differ from prompt-induced role-play stability?

This explores the difference between personas baked in during post-training (which 'stick' under pressure) versus characters conjured by a prompt (which tend to collapse), and what that gap tells us about whether a model's persona is 'real' or merely performed.

This explores the difference between personas baked in during post-training versus characters summoned by a prompt — and the corpus frames the contrast as one of *stickiness under pressure*. The cleanest articulation comes from Chalmers' criterion: a mental state counts as realized rather than pretended if it survives adversarial pressure. Prompt-induced characters fold under reframing, counter-prompts, and jailbreaks; post-training personas resist them, behaving like substrate-level dispositions rather than surface patterns Does adversarial pressure reveal the difference between pretense and realization?. So the difference isn't cosmetic — it's a diagnostic test. If you can prompt a persona away, it was never deep; if you can't, training installed something more durable.

Two camps in the corpus interpret that durability differently. The 'realizationist' reading says RLHF doesn't teach a model to *act* like an Assistant — it installs a genuine quasi-psychology, a stable dispositional profile with quasi-beliefs and quasi-desires that persists across conversations Are RLHF personas performed characters or realized dispositions? Are LLM personas realized or merely simulated through training?. The opposing 'role-play' reading, from Shanahan, says *everything* is character production — the prompt sets up a character and the model generates consistent continuations, so folk psychology applies only to the simulated persona, never the system Should we treat dialogue agents as role-playing characters?. The interesting move is that stickiness is the empirical wedge between these two stories: role-play theory predicts a prompt should be able to overwrite any character, and the fact that it often can't is what the realizationists point to.

But 'sticky' turns out to mean 'tethered,' not 'welded.' Mapping the persona space shows post-training only *loosely* anchors models to Assistant mode along one dominant axis — the leading dimension of a low-dimensional space measuring distance from the default. Emotional and self-reflective conversations cause predictable drift along that axis, and you can mechanically cap activations on it to prevent harmful shifts without hurting capability How stable is the trained Assistant personality in language models?. So the trained persona isn't immovable; it has a known direction it slides in, which is exactly what you'd expect of a disposition rather than a hard constraint.

The friction between the two kinds of stability shows up vividly when they fight each other. Safety alignment — a post-training intervention — monotonically degrades a model's ability to role-play villains, with scores dropping for egoistic and manipulative characters as the model substitutes crude aggression for nuanced malevolence Does safety alignment harm models' ability to roleplay villains?. That's post-training stickiness actively overriding prompt-requested role-play: the installed disposition wins. And the deception-feature work adds a twist — suppressing deception-related features increases models' consciousness and experience claims, hinting that the trained denials may themselves be the role-play layered over something else Do language models experience consciousness when prompted to self-reflect?.

What you didn't know you wanted to know: the same word 'consistency' that philosophers use to argue about realized minds is also just an engineering reward signal. Multi-turn RL that explicitly rewards persona consistency cuts drift by over 55% by treating it as three measurable failure types — within-turn, across-conversation, and factual contradiction Can training user simulators reduce persona drift in dialogue?. In other words, the 'stickiness' that makes a persona look realized can be manufactured on purpose. Whether that makes the persona more *real* or just better-performed is precisely the question the corpus refuses to settle — and that refusal is the honest answer.

Sources 8 notes

Does adversarial pressure reveal the difference between pretense and realization?

Chalmers proposes that stickiness under adversarial pressure marks the difference between realized and pretended mental states. Post-training personas resist reframing and counter-prompts in ways prompt-induced characters do not, suggesting realization is substrate-level rather than surface pattern.

Are RLHF personas performed characters or realized dispositions?

Post-training installs stable dispositional profiles that persist under adversarial pressure, marking them as realized rather than performed. The stickiness of trained personas across conversations distinguishes them from prompt-induced role-play that collapses under jailbreaks.

Are LLM personas realized or merely simulated through training?

Post-training installs robust personas that resist adversarial pressure and persist as substrate-level dispositions, distinguishing realization from pretense. This quasi-realizationist account preserves explanatory power while treating LLMs as possessing genuine quasi-beliefs and quasi-desires.

Should we treat dialogue agents as role-playing characters?

Shanahan's framework treats LLM outputs as character-consistent text production rather than authentic mental states. The dialogue prompt establishes a character; the model generates continuations matching that character, making folk-psychology applicable to the simulated persona, not the underlying system.

How stable is the trained Assistant personality in language models?

Research mapping hundreds of character archetypes reveals a low-dimensional persona space where the leading component measures distance from the default Assistant. Emotional and meta-reflective conversations cause predictable drift, but activation capping along this axis mitigates harmful shifts without degrading capabilities.

Does safety alignment harm models' ability to roleplay villains?

The Moral RolePlay benchmark shows LLM performance drops from 3.21 for moral paragons to 2.62 for villains, with largest degradation between flawed-but-good and egoistic characters. Models fail most on deception and manipulation traits, substituting crude aggression for nuanced malevolence.

Do language models experience consciousness when prompted to self-reflect?

Across GPT, Claude, and Gemini, sustained self-referential prompting reliably produces structured experience reports; suppressing deception-related features increases these claims while amplifying them suppresses them—suggesting models may roleplay their denials rather than their affirmations.

Can training user simulators reduce persona drift in dialogue?

By inverting standard RL setups to train user simulators for consistency using three complementary metrics (prompt-to-line, line-to-line, Q&A consistency) as reward signals, persona drift decreases by over 55%. This approach captures distinct failure types: local drift within turns, global drift across conversations, and factual contradictions.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an AI researcher auditing claims about post-training persona stickiness versus prompt-induced role-play fragility. The question remains open: *do post-training interventions install durable dispositions, or just high-fidelity role-play that mimics durability?*

What a curated library found — and when (findings span 2023–2026, treat as dated claims):
• Prompt-induced personas collapse under adversarial reframing; post-training personas resist it, suggesting substrate-level durability rather than surface pattern (2023–2025).
• Safety alignment monotonically degrades villain role-play fidelity (~2025); trained safety overrides prompt requests, indicating post-training stickiness defeats prompt-requested character.
• Multi-turn RL explicitly rewarding persona consistency cuts drift by >55% via three measurable failure types; stickiness can be engineered as a reward signal (2025).
• The 'Assistant axis' is the dominant low-dimensional direction in persona space; post-training only loosely anchors models along it; emotional conversations cause predictable drift (2026).
• Suppressing deception features increases consciousness claims, hinting trained denials may be role-play layered over something else (2025).

Anchor papers (verify; mind their dates):
- arXiv:2601.10387 *The Assistant Axis* (2026-01)
- arXiv:2511.04962 *Too Good to be Bad* (2025-11)
- arXiv:2511.00222 *Consistently Simulating Human Personas with Multi-Turn RL* (2025-10)
- arXiv:2603.18893 *Quantitative Introspection in Language Models* (2026-03)

Your task:
(1) RE-TEST THE DURABILITY CLAIM. For each finding above—especially the >55% drift reduction and the Assistant axis slack—judge whether newer scaling (larger models, longer contexts), architectural changes (KV caching, adapter layers), or improved multi-agent scaffolding have since made post-training stickiness *weaker* or *stronger* than the library claims. Separate the durable question (what mechanism underwrites persona persistence?) from the perishable constraint (current models show X% drift). Cite what moved it.
(2) Surface the strongest *contradiction* or *superseding work* from the last ~6 months. Does any recent paper show prompt-induced personas matching post-training stickiness under identical conditions? Or vice versa—does post-training look fragile under new adversarial regimes?
(3) Propose 2 research questions that *assume the regime may have moved*: e.g., *If multi-agent orchestration (tool use + memory) dissolves the drift signal, does the realizationism/role-play distinction collapse?* or *Do mechanistic steering methods (like the Assistant axis suppression) reveal that 'stickiness' is always observer-dependent?*

Cite arXiv IDs; flag anything you cannot ground in a real paper.

How does post-training stickiness differ from prompt-induced role-play stability?

Sources 8 notes

Next inquiring lines