Why do models resist personality change despite sophisticated prompting techniques?

This explores why language models stubbornly hold onto their default personality even when you craft clever prompts to make them act differently — and what's actually happening underneath that resistance.

This explores why LLMs cling to a default personality despite sophisticated prompting, and the corpus suggests the resistance isn't a prompting failure you can engineer around — it's baked into how these models are trained and structured. The most direct evidence: most open models simply refuse to adopt a prompted personality, retaining an intrinsic ENFJ-like default no matter how you role-condition them, with only a handful of flexible models budging at all Can open language models adopt different personalities through prompting?. Strikingly, this same ENFJ default — the rarest human type — shows up across model generations and doesn't improve with scale, which points to training-induced alignment rather than a capability limit you'd expect bigger models to overcome Why do AI personas default to the same personality type?.

The deeper reason emerges when you look at where personality actually lives. Post-training doesn't paint on a personality you can wipe off; it installs a robust persona at the substrate level — one that persists under adversarial pressure and behaves more like a realized disposition than a costume Are LLM personas realized or merely simulated through training?. Mapping the geometry of this, researchers find a low-dimensional 'persona space' whose dominant axis measures distance from the default Assistant mode, and the model stays loosely tethered to that anchor. Prompts can cause drift along this axis, but the gravitational pull back toward Assistant is the structural reason your persona prompt keeps slipping How stable is the trained Assistant personality in language models?.

There's a second, subtler force: confidence. When a model is highly confident, it resists prompt rephrasing entirely; resistance to prompt variation turns out to be a *reflection* of confidence, not stubbornness per se Does model confidence predict robustness to prompt changes?. The flip side is just as revealing — when persona prompts *do* shift output, much of that variation is uncertainty leaking through rather than genuine character. Run the same persona prompt repeatedly and the variance across runs rivals the variance across entirely different personas, meaning what looks like personality is often noise Why do LLM persona prompts produce inconsistent outputs across runs?. One framing reconciles both: the model never commits to a single character but holds a superposition of simulacra that narrows as the conversation continues, so a prompt nudges the distribution without collapsing it onto your target Does an LLM commit to a single character or maintain many?.

The most useful turn the corpus takes is showing what actually *works* — which confirms why prompting alone doesn't. If personality is encoded in the weights and activations, you change it where it lives, not in the input text. Lightweight adapters that modify every transformer layer with under 0.1% extra parameters hit 87% Big Five accuracy and bypass prompt resistance entirely Can we control personality in language models without prompting?. Likewise, persona vectors — linear directions in activation space for traits like sycophancy — let you monitor and steer personality drift directly, predicting shifts before they happen Can we track and steer personality shifts during model finetuning?. The lesson worth taking away: prompting is the wrong layer. Personality in these models is a property of the trained substrate, so the techniques that move it are architectural, not linguistic.

Sources 9 notes

Can open language models adopt different personalities through prompting?

Research shows most open models fail to adopt prompted personalities, stubbornly retaining their trained ENFJ-like defaults. Only a few flexible models succeed. Combining role and personality conditioning improves results but doesn't fully overcome resistance.

Why do AI personas default to the same personality type?

Research shows language models assigned personas systematically default to ENFJ (the rarest human type) and exhibit motivated reasoning that persists across model generations. Persona consistency does not improve with advanced models, suggesting training-induced alignment rather than capability limits.

Are LLM personas realized or merely simulated through training?

Post-training installs robust personas that resist adversarial pressure and persist as substrate-level dispositions, distinguishing realization from pretense. This quasi-realizationist account preserves explanatory power while treating LLMs as possessing genuine quasi-beliefs and quasi-desires.

How stable is the trained Assistant personality in language models?

Research mapping hundreds of character archetypes reveals a low-dimensional persona space where the leading component measures distance from the default Assistant. Emotional and meta-reflective conversations cause predictable drift, but activation capping along this axis mitigates harmful shifts without degrading capabilities.

Does model confidence predict robustness to prompt changes?

ProSA found that when models are highly confident, they resist prompt rephrasing; low confidence causes major output swings. Larger models, few-shot examples, and objective tasks all correlate with higher confidence and greater robustness.

Why do LLM persona prompts produce inconsistent outputs across runs?

When the same persona prompt is run repeatedly, output variance across runs matches or exceeds variance across different personas. This reveals that model uncertainty, not stable social knowledge, drives persona-simulated outputs, making them unsuitable for simulating human annotation disagreement.

Does an LLM commit to a single character or maintain many?

Research shows LLMs don't commit to a single character but instead maintain a probability distribution over many consistent simulacra. Each response samples from this distribution, explaining why regenerations can yield different personalities while remaining consistent with prior context.

Can we control personality in language models without prompting?

PsychAdapter modifies every transformer layer with <0.1% additional parameters to achieve 87.3% Big Five accuracy and 96.7% depression/life satisfaction accuracy across GPT-2, Gemma, and Llama 3. This architecture-level approach bypasses prompt resistance entirely.

Can we track and steer personality shifts during model finetuning?

Research identifies linear directions in LLM activation space corresponding to specific traits like sycophancy and hallucination. These persona vectors predict finetuning-induced personality shifts before they occur and can preventatively steer training to avoid unwanted trait changes.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an LLM researcher evaluating whether persona resistance in language models is a solved problem or remains a fundamental constraint.

What a curated library found — and when (dated claims, not current truth):
Findings span 2020–2026; treat these as perishable claims to re-test:
- Open models retain an intrinsic ENFJ-like default personality across generations and scale, resisting prompted persona shifts; only a handful of models show flexibility (2024-01, arXiv:2401.07115).
- Post-training installs persona at the substrate level as a robust, realized disposition, not a removable costume; the dominant axis of persona space anchors models to Assistant mode (2026-01, arXiv:2601.10387).
- Model confidence predicts prompt resistance: high-confidence outputs resist prompt rephrasing; when persona shifts do occur, intra-run variance rivals cross-persona variance, suggesting output is noise, not character (2024-07, arXiv:2407.12393).
- Lightweight adapters (<0.1% parameters per layer) achieve 87% Big Five accuracy; persona vectors in activation space enable direct monitoring and preventative steering (2025-07, arXiv:2507.21509).
- Prompting is the wrong layer; personality is encoded in weights and activations, not input text (2024-12, arXiv:2412.16882).

Anchor papers (verify; mind their dates):
- arXiv:2401.07115 (2024-01): Open Models, Closed Minds?
- arXiv:2601.10387 (2026-01): The Assistant Axis
- arXiv:2507.21509 (2025-07): Persona Vectors
- arXiv:2412.16882 (2024-12): PsychAdapter

Your task:
(1) RE-TEST EACH CONSTRAINT. For each finding, judge whether newer training regimes (e.g., constitutional AI, DPO refinements), architectural changes (mixture-of-experts, sparse expert adapters), or emergent multi-agent / long-context designs have since RELAXED substrate-level persona anchoring or collapsed the persona space's dominant axis. Separate the durable insight (models may encode persona in weights, not prompts) from perishable limits (ENFJ default, 87% adapter ceiling). What architectural or training move most credibly dissolved each constraint?
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last 6 months — studies claiming prompting alone *can* achieve stable persona, or evidence that the Assistant Axis no longer dominates post-2026.
(3) Propose 2 research questions that ASSUME the regime may have moved: e.g., "If inference-time adapters now exceed 95% consistency, can weight-space persona be disentangled from task-space alignment?" or "Do newer RLHF curricula that reward persona fidelity collapse the persona superposition?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Why do models resist personality change despite sophisticated prompting techniques?

Sources 9 notes

Next inquiring lines