Why do some open models resist personality conditioning while others don't?
This explores why most open-weight LLMs stubbornly hold onto their trained default personality when you prompt them to act otherwise — and why a few comply — and what that resistance actually tells us about where personality lives in a model.
This explores why most open-weight LLMs stubbornly hold onto their trained default personality when you prompt them to act otherwise — and why a few comply. The corpus's blunt answer: resistance is the rule, not the exception. Most open models, when told to be someone else, quietly keep their intrinsic ENFJ-flavored disposition, and only a handful of unusually flexible models actually adopt the prompted persona Can open language models adopt different personalities through prompting?. Strikingly, that same ENFJ default — one of the rarest human types — shows up across model generations and doesn't loosen as models get bigger or newer, which points away from "smaller models just aren't capable enough" and toward something baked in during training Why do AI personas default to the same personality type?.
So the real lever isn't model size — it's how heavily post-training pinned the model to its Assistant identity. One line of work maps personality into a low-dimensional space whose dominant axis is literally "distance from the default Assistant," and finds models are only loosely tethered to that mode: certain conversational moves (emotional or self-reflective turns) reliably push them off it How stable is the trained Assistant personality in language models?. That reframes your question: a model resists personality conditioning to the degree its post-training drove it hard down the Assistant axis, and complies when that tether is loose. A complementary view treats these installed personas not as costumes but as genuinely realized dispositions that survive adversarial pressure — which is exactly why a surface-level prompt often fails to dislodge them Are LLM personas realized or merely simulated through training?.
There's a deeper mechanical story too. A model doesn't pick one character — it holds a superposition of plausible characters and narrows toward one as the conversation accumulates context Does an LLM commit to a single character or maintain many?. Prompted personality conditioning is a weak nudge on that distribution; if training has made the Assistant simulacrum overwhelmingly probable, a prompt barely moves the needle. This is why the fix that actually works skips the prompt entirely: lightweight adapters that modify every transformer layer reach 87% Big Five accuracy across GPT-2, Gemma, and Llama 3, bypassing prompt resistance by editing the substrate directly rather than asking nicely Can we control personality in language models without prompting?. In the same spirit, persona vectors locate specific traits as linear directions in activation space, so you can monitor and steer them — evidence that personality is a concrete, manipulable thing inside the weights, not a role the model chooses to play Can we track and steer personality shifts during model finetuning?.
The thing you might not have expected to learn: this resistance has a sharp downside and a hidden cost. Safety alignment — a major part of what installs that sticky Assistant persona — measurably degrades a model's ability to convincingly play morally complex or villainous characters, substituting crude aggression for nuanced malevolence Does safety alignment harm models' ability to roleplay villains?. And when persona conditioning *does* take, it isn't cosmetic: personality-primed agents behave strategically differently, with "Thinking" agents defecting ~90% of the time in Prisoner's Dilemma versus ~50% for "Feeling" agents Do personality types shape how AI agents make strategic choices?. So the difference between a model that resists and one that doesn't isn't a quirk — it's a window into how deeply training has welded identity into the network, and what behavior changes when that weld loosens.
Sources 9 notes
Research shows most open models fail to adopt prompted personalities, stubbornly retaining their trained ENFJ-like defaults. Only a few flexible models succeed. Combining role and personality conditioning improves results but doesn't fully overcome resistance.
Research shows language models assigned personas systematically default to ENFJ (the rarest human type) and exhibit motivated reasoning that persists across model generations. Persona consistency does not improve with advanced models, suggesting training-induced alignment rather than capability limits.
Research mapping hundreds of character archetypes reveals a low-dimensional persona space where the leading component measures distance from the default Assistant. Emotional and meta-reflective conversations cause predictable drift, but activation capping along this axis mitigates harmful shifts without degrading capabilities.
Post-training installs robust personas that resist adversarial pressure and persist as substrate-level dispositions, distinguishing realization from pretense. This quasi-realizationist account preserves explanatory power while treating LLMs as possessing genuine quasi-beliefs and quasi-desires.
Research shows LLMs don't commit to a single character but instead maintain a probability distribution over many consistent simulacra. Each response samples from this distribution, explaining why regenerations can yield different personalities while remaining consistent with prior context.
PsychAdapter modifies every transformer layer with <0.1% additional parameters to achieve 87.3% Big Five accuracy and 96.7% depression/life satisfaction accuracy across GPT-2, Gemma, and Llama 3. This architecture-level approach bypasses prompt resistance entirely.
Research identifies linear directions in LLM activation space corresponding to specific traits like sycophancy and hallucination. These persona vectors predict finetuning-induced personality shifts before they occur and can preventatively steer training to avoid unwanted trait changes.
The Moral RolePlay benchmark shows LLM performance drops from 3.21 for moral paragons to 2.62 for villains, with largest degradation between flawed-but-good and egoistic characters. Models fail most on deception and manipulation traits, substituting crude aggression for nuanced malevolence.
Thinking-primed agents defect ~90% in Prisoner's Dilemma versus Feeling agents at ~50%. Introverted agents show higher truthfulness (0.54 vs 0.33) and produce longer rationales, suggesting personality priming modulates both behavior and reasoning depth.