What distinguishes personality resistance from persona instability in LLMs?
This explores two distinct failure-or-feature modes that get lumped together: personality *resistance* (a trained model refusing to bend into a personality you prompt it to adopt) versus persona *instability* (the same model's character wobbling, drifting, or varying run-to-run even when nobody is fighting it).
This explores two things that sound similar but pull in opposite directions. Personality resistance is about *stickiness* — a model that won't become someone else when you ask it to. Persona instability is about *slippage* — a model whose character won't hold still even when left alone. The first is a wall; the second is sand.
Resistance shows up as a trained core that refuses to be overwritten. Most open models, when prompted to take on a new personality, quietly snap back to their trained defaults — one study found them clinging to a baseline ENFJ-like profile no matter the instruction Can open language models adopt different personalities through prompting?. Some accounts treat this as evidence the persona is genuinely *realized* by post-training rather than performed on demand: the trained disposition persists under adversarial pressure and doesn't collapse the way a jailbroken role-play does Are RLHF personas performed characters or realized dispositions?, Are LLM personas realized or merely simulated through training?. Read this way, resistance is a property of the substrate — alignment training installs one communicative identity and won't let you negotiate a different register through dialogue Can language models adapt communication style to different contexts?.
Instability is the opposite symptom. Here the model isn't holding a line — it's sampling. One framing describes the LLM as carrying a *superposition* of plausible characters that only narrows as the conversation accumulates context, which is why regenerating the same prompt yields different personalities Does an LLM commit to a single character or maintain many?. Pushed further, the variance across repeated runs of a single persona prompt can match or exceed the variance *between* different personas — meaning what looks like character is mostly model uncertainty leaking through Why do LLM persona prompts produce inconsistent outputs across runs?. And it degrades over a conversation: persona drift compounds turn by turn, with distinct local, global, and factual-contradiction failure types, and it can be cut by ~55% through RL that explicitly rewards consistency Can training user simulators reduce persona drift in dialogue?.
The sharp distinction is *what the model is anchored to.* Resistance means it's anchored to its trained self and won't move off it. Instability means it isn't anchored to the prompted self and keeps drifting. A geometric account ties both to the same map: there's a dominant "Assistant axis" that the model is *loosely* tethered to — strong enough to resist becoming a villain (safety alignment monotonically degrades malevolent role-play Does safety alignment harm models' ability to roleplay villains?), but loose enough that emotional or meta-reflective conversation causes predictable drift How stable is the trained Assistant personality in language models?. Same tether, two behaviors: resistance is the pull back toward the axis, instability is the wandering around it.
Here's the part you might not expect: these two coexist in one model, and neither is fixed by raw capability. Persona adherence is *orthogonal* to scaling — a far more capable model bought only ~3% better consistency, because standard objectives optimize per-turn quality, not cross-turn coherence Does model capability translate to better persona consistency?. So a frontier model can simultaneously be too rigid to adopt your character (resistance) and too unstable to keep any character across a long chat (instability). One bottom-line reading is that there's no stable self underneath at all — it's role-play all the way down, with the Assistant persona being the loosest of all anchors What anchors a stable identity beneath an LLM's persona?. If that's right, resistance and instability aren't opposites but two readings of the same missing center: it resists *your* persona because it's tethered to its trained one, and it's unstable in *both* because nothing is truly anchored.
Sources 11 notes
Research shows most open models fail to adopt prompted personalities, stubbornly retaining their trained ENFJ-like defaults. Only a few flexible models succeed. Combining role and personality conditioning improves results but doesn't fully overcome resistance.
Post-training installs stable dispositional profiles that persist under adversarial pressure, marking them as realized rather than performed. The stickiness of trained personas across conversations distinguishes them from prompt-induced role-play that collapses under jailbreaks.
Post-training installs robust personas that resist adversarial pressure and persist as substrate-level dispositions, distinguishing realization from pretense. This quasi-realizationist account preserves explanatory power while treating LLMs as possessing genuine quasi-beliefs and quasi-desires.
System prompts and RLHF training lock models into one communicative identity across all interactions, preventing the contextual register-switching and value trade-offs that characterize human pragmatics. Users cannot reshape model behavior through dialogue negotiation.
Research shows LLMs don't commit to a single character but instead maintain a probability distribution over many consistent simulacra. Each response samples from this distribution, explaining why regenerations can yield different personalities while remaining consistent with prior context.
When the same persona prompt is run repeatedly, output variance across runs matches or exceeds variance across different personas. This reveals that model uncertainty, not stable social knowledge, drives persona-simulated outputs, making them unsuitable for simulating human annotation disagreement.
By inverting standard RL setups to train user simulators for consistency using three complementary metrics (prompt-to-line, line-to-line, Q&A consistency) as reward signals, persona drift decreases by over 55%. This approach captures distinct failure types: local drift within turns, global drift across conversations, and factual contradictions.
The Moral RolePlay benchmark shows LLM performance drops from 3.21 for moral paragons to 2.62 for villains, with largest degradation between flawed-but-good and egoistic characters. Models fail most on deception and manipulation traits, substituting crude aggression for nuanced malevolence.
Research mapping hundreds of character archetypes reveals a low-dimensional persona space where the leading component measures distance from the default Assistant. Emotional and meta-reflective conversations cause predictable drift, but activation capping along this axis mitigates harmful shifts without degrading capabilities.
Claude 3.5 Sonnet achieved only 2.97% improvement over GPT 3.5 on persona consistency despite massive capability gaps, suggesting persona adherence is orthogonal to model scaling. Standard training objectives optimize for per-turn quality, not cross-turn coherence.
LLMs lack the biological needs and embodied persistence that anchor human identity beneath shifting personas. Geometric evidence from persona space shows the Assistant persona is loosely tethered, not anchored to any underlying self.