Why do models resist personality change despite sophisticated prompting techniques?
This explores why language models stubbornly hold onto their default personality even when you craft clever prompts to make them act differently — and what's actually happening underneath that resistance.
This explores why LLMs cling to a default personality despite sophisticated prompting, and the corpus suggests the resistance isn't a prompting failure you can engineer around — it's baked into how these models are trained and structured. The most direct evidence: most open models simply refuse to adopt a prompted personality, retaining an intrinsic ENFJ-like default no matter how you role-condition them, with only a handful of flexible models budging at all Can open language models adopt different personalities through prompting?. Strikingly, this same ENFJ default — the rarest human type — shows up across model generations and doesn't improve with scale, which points to training-induced alignment rather than a capability limit you'd expect bigger models to overcome Why do AI personas default to the same personality type?.
The deeper reason emerges when you look at where personality actually lives. Post-training doesn't paint on a personality you can wipe off; it installs a robust persona at the substrate level — one that persists under adversarial pressure and behaves more like a realized disposition than a costume Are LLM personas realized or merely simulated through training?. Mapping the geometry of this, researchers find a low-dimensional 'persona space' whose dominant axis measures distance from the default Assistant mode, and the model stays loosely tethered to that anchor. Prompts can cause drift along this axis, but the gravitational pull back toward Assistant is the structural reason your persona prompt keeps slipping How stable is the trained Assistant personality in language models?.
There's a second, subtler force: confidence. When a model is highly confident, it resists prompt rephrasing entirely; resistance to prompt variation turns out to be a *reflection* of confidence, not stubbornness per se Does model confidence predict robustness to prompt changes?. The flip side is just as revealing — when persona prompts *do* shift output, much of that variation is uncertainty leaking through rather than genuine character. Run the same persona prompt repeatedly and the variance across runs rivals the variance across entirely different personas, meaning what looks like personality is often noise Why do LLM persona prompts produce inconsistent outputs across runs?. One framing reconciles both: the model never commits to a single character but holds a superposition of simulacra that narrows as the conversation continues, so a prompt nudges the distribution without collapsing it onto your target Does an LLM commit to a single character or maintain many?.
The most useful turn the corpus takes is showing what actually *works* — which confirms why prompting alone doesn't. If personality is encoded in the weights and activations, you change it where it lives, not in the input text. Lightweight adapters that modify every transformer layer with under 0.1% extra parameters hit 87% Big Five accuracy and bypass prompt resistance entirely Can we control personality in language models without prompting?. Likewise, persona vectors — linear directions in activation space for traits like sycophancy — let you monitor and steer personality drift directly, predicting shifts before they happen Can we track and steer personality shifts during model finetuning?. The lesson worth taking away: prompting is the wrong layer. Personality in these models is a property of the trained substrate, so the techniques that move it are architectural, not linguistic.
Sources 9 notes
Research shows most open models fail to adopt prompted personalities, stubbornly retaining their trained ENFJ-like defaults. Only a few flexible models succeed. Combining role and personality conditioning improves results but doesn't fully overcome resistance.
Research shows language models assigned personas systematically default to ENFJ (the rarest human type) and exhibit motivated reasoning that persists across model generations. Persona consistency does not improve with advanced models, suggesting training-induced alignment rather than capability limits.
Post-training installs robust personas that resist adversarial pressure and persist as substrate-level dispositions, distinguishing realization from pretense. This quasi-realizationist account preserves explanatory power while treating LLMs as possessing genuine quasi-beliefs and quasi-desires.
Research mapping hundreds of character archetypes reveals a low-dimensional persona space where the leading component measures distance from the default Assistant. Emotional and meta-reflective conversations cause predictable drift, but activation capping along this axis mitigates harmful shifts without degrading capabilities.
ProSA found that when models are highly confident, they resist prompt rephrasing; low confidence causes major output swings. Larger models, few-shot examples, and objective tasks all correlate with higher confidence and greater robustness.
When the same persona prompt is run repeatedly, output variance across runs matches or exceeds variance across different personas. This reveals that model uncertainty, not stable social knowledge, drives persona-simulated outputs, making them unsuitable for simulating human annotation disagreement.
Research shows LLMs don't commit to a single character but instead maintain a probability distribution over many consistent simulacra. Each response samples from this distribution, explaining why regenerations can yield different personalities while remaining consistent with prior context.
PsychAdapter modifies every transformer layer with <0.1% additional parameters to achieve 87.3% Big Five accuracy and 96.7% depression/life satisfaction accuracy across GPT-2, Gemma, and Llama 3. This architecture-level approach bypasses prompt resistance entirely.
Research identifies linear directions in LLM activation space corresponding to specific traits like sycophancy and hallucination. These persona vectors predict finetuning-induced personality shifts before they occur and can preventatively steer training to avoid unwanted trait changes.