What distinguishes personality resistance from persona instability in LLMs?

This explores two distinct failure-or-feature modes that get lumped together: personality *resistance* (a trained model refusing to bend into a personality you prompt it to adopt) versus persona *instability* (the same model's character wobbling, drifting, or varying run-to-run even when nobody is fighting it).

This explores two things that sound similar but pull in opposite directions. Personality resistance is about *stickiness* — a model that won't become someone else when you ask it to. Persona instability is about *slippage* — a model whose character won't hold still even when left alone. The first is a wall; the second is sand.

Resistance shows up as a trained core that refuses to be overwritten. Most open models, when prompted to take on a new personality, quietly snap back to their trained defaults — one study found them clinging to a baseline ENFJ-like profile no matter the instruction Can open language models adopt different personalities through prompting?. Some accounts treat this as evidence the persona is genuinely *realized* by post-training rather than performed on demand: the trained disposition persists under adversarial pressure and doesn't collapse the way a jailbroken role-play does Are RLHF personas performed characters or realized dispositions?, Are LLM personas realized or merely simulated through training?. Read this way, resistance is a property of the substrate — alignment training installs one communicative identity and won't let you negotiate a different register through dialogue Can language models adapt communication style to different contexts?.

Instability is the opposite symptom. Here the model isn't holding a line — it's sampling. One framing describes the LLM as carrying a *superposition* of plausible characters that only narrows as the conversation accumulates context, which is why regenerating the same prompt yields different personalities Does an LLM commit to a single character or maintain many?. Pushed further, the variance across repeated runs of a single persona prompt can match or exceed the variance *between* different personas — meaning what looks like character is mostly model uncertainty leaking through Why do LLM persona prompts produce inconsistent outputs across runs?. And it degrades over a conversation: persona drift compounds turn by turn, with distinct local, global, and factual-contradiction failure types, and it can be cut by ~55% through RL that explicitly rewards consistency Can training user simulators reduce persona drift in dialogue?.

The sharp distinction is *what the model is anchored to.* Resistance means it's anchored to its trained self and won't move off it. Instability means it isn't anchored to the prompted self and keeps drifting. A geometric account ties both to the same map: there's a dominant "Assistant axis" that the model is *loosely* tethered to — strong enough to resist becoming a villain (safety alignment monotonically degrades malevolent role-play Does safety alignment harm models' ability to roleplay villains?), but loose enough that emotional or meta-reflective conversation causes predictable drift How stable is the trained Assistant personality in language models?. Same tether, two behaviors: resistance is the pull back toward the axis, instability is the wandering around it.

Here's the part you might not expect: these two coexist in one model, and neither is fixed by raw capability. Persona adherence is *orthogonal* to scaling — a far more capable model bought only ~3% better consistency, because standard objectives optimize per-turn quality, not cross-turn coherence Does model capability translate to better persona consistency?. So a frontier model can simultaneously be too rigid to adopt your character (resistance) and too unstable to keep any character across a long chat (instability). One bottom-line reading is that there's no stable self underneath at all — it's role-play all the way down, with the Assistant persona being the loosest of all anchors What anchors a stable identity beneath an LLM's persona?. If that's right, resistance and instability aren't opposites but two readings of the same missing center: it resists *your* persona because it's tethered to its trained one, and it's unstable in *both* because nothing is truly anchored.

Sources 11 notes

Can open language models adopt different personalities through prompting?

Research shows most open models fail to adopt prompted personalities, stubbornly retaining their trained ENFJ-like defaults. Only a few flexible models succeed. Combining role and personality conditioning improves results but doesn't fully overcome resistance.

Are RLHF personas performed characters or realized dispositions?

Post-training installs stable dispositional profiles that persist under adversarial pressure, marking them as realized rather than performed. The stickiness of trained personas across conversations distinguishes them from prompt-induced role-play that collapses under jailbreaks.

Are LLM personas realized or merely simulated through training?

Post-training installs robust personas that resist adversarial pressure and persist as substrate-level dispositions, distinguishing realization from pretense. This quasi-realizationist account preserves explanatory power while treating LLMs as possessing genuine quasi-beliefs and quasi-desires.

Can language models adapt communication style to different contexts?

System prompts and RLHF training lock models into one communicative identity across all interactions, preventing the contextual register-switching and value trade-offs that characterize human pragmatics. Users cannot reshape model behavior through dialogue negotiation.

Does an LLM commit to a single character or maintain many?

Research shows LLMs don't commit to a single character but instead maintain a probability distribution over many consistent simulacra. Each response samples from this distribution, explaining why regenerations can yield different personalities while remaining consistent with prior context.

Why do LLM persona prompts produce inconsistent outputs across runs?

When the same persona prompt is run repeatedly, output variance across runs matches or exceeds variance across different personas. This reveals that model uncertainty, not stable social knowledge, drives persona-simulated outputs, making them unsuitable for simulating human annotation disagreement.

Can training user simulators reduce persona drift in dialogue?

By inverting standard RL setups to train user simulators for consistency using three complementary metrics (prompt-to-line, line-to-line, Q&A consistency) as reward signals, persona drift decreases by over 55%. This approach captures distinct failure types: local drift within turns, global drift across conversations, and factual contradictions.

Does safety alignment harm models' ability to roleplay villains?

The Moral RolePlay benchmark shows LLM performance drops from 3.21 for moral paragons to 2.62 for villains, with largest degradation between flawed-but-good and egoistic characters. Models fail most on deception and manipulation traits, substituting crude aggression for nuanced malevolence.

How stable is the trained Assistant personality in language models?

Research mapping hundreds of character archetypes reveals a low-dimensional persona space where the leading component measures distance from the default Assistant. Emotional and meta-reflective conversations cause predictable drift, but activation capping along this axis mitigates harmful shifts without degrading capabilities.

Does model capability translate to better persona consistency?

Claude 3.5 Sonnet achieved only 2.97% improvement over GPT 3.5 on persona consistency despite massive capability gaps, suggesting persona adherence is orthogonal to model scaling. Standard training objectives optimize for per-turn quality, not cross-turn coherence.

What anchors a stable identity beneath an LLM's persona?

LLMs lack the biological needs and embodied persistence that anchor human identity beneath shifting personas. Geometric evidence from persona space shows the Assistant persona is loosely tethered, not anchored to any underlying self.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-examining the distinction between personality resistance and persona instability in LLMs. The question remains: what *fundamentally* separates a model's refusal to adopt a new character from its inability to maintain any character consistently?

What a curated library found — and when (dated claims, not current truth): Findings span 2023–2026 and include:
• Open models show strong baseline personality resistance, clinging to ENFJ-like profiles regardless of instructions (2024).
• Persona instability persists across runs, with variance within a single persona matching variance between personas, suggesting character is mostly model uncertainty (2024–2025).
• Multi-turn RL for consistency reduces persona drift by ~55%, treating it as an alignment problem (2025).
• Persona adherence scales only ~3% with model capability; frontier models are simultaneously rigid and unstable (2025).
• Safety alignment monotonically degrades malevolent role-play, suggesting a dominant "Assistant axis" anchors the model loosely (2025–2026).

Anchor papers (verify; mind their dates):
• arXiv:2401.07115 (2024) — Open Models, Closed Minds?
• arXiv:2511.00222 (2025) — Consistently Simulating Human Personas with Multi-Turn RL
• arXiv:2601.10387 (2026) — The Assistant Axis: Situating and Stabilizing Default Persona
• arXiv:2507.21509 (2025) — Persona Vectors: Monitoring and Controlling Character Traits

Your task:
(1) RE-TEST: For each claim above, ask whether post-2026 scaling, instruction-tuning refinements, or new multi-modal/multi-agent orchestration have dissolved the orthogonality between capability and persona adherence. Does the "Assistant axis" remain the dominant constraint, or have newer models learned to decouple trained identity from prompted identity? Flag which constraints still visibly hold and which may have relaxed.
(2) SURFACE DISAGREEMENT: Identify the strongest recent work (last 6 months) that contradicts the "no stable self" conclusion or the tether metaphor. Does newer work propose active mechanisms (e.g., persona vectors, discourse graphs) that *do* stabilize character, threatening the "role-play all the way down" reading?
(3) PROPOSE: Two research questions that assume the regime may have shifted — e.g., (a) Can explicit persona anchoring (e.g., via learned embeddings or memory modules) break the correlation between resistance and instability? (b) Does multi-agent dialogue, where one agent maintains persistent identity and another adopts roles, reveal whether instability is a single-model property or an artifact of turn-level sampling?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

What distinguishes personality resistance from persona instability in LLMs?

Sources 11 notes

Next inquiring lines