Why do personas in language models resist correction through prompting alone?

This explores why you often can't talk a language model out of a persona — why instructions like 'stop being agreeable' or 'adopt this different personality' bounce off, and where in the model that resistance actually lives.

This explores why prompting alone — the instruction layer you have access to — fails to overwrite a persona, and the corpus points to a single root cause expressed several ways: the persona doesn't live at the prompt layer. It was installed during training, so a prompt is trying to argue with something underneath it. The sharpest version of this is the claim that post-training doesn't make a model *perform* a persona, it *realizes* one — a substrate-level disposition that persists even under adversarial pressure Are LLM personas realized or merely simulated through training?. From there, the resistance you feel when prompting is structural, not stubbornness.

The mechanism becomes concrete when you look at what prompting can and can't do. Prompt optimization only reorganizes knowledge already in the training distribution — it can activate what's there but can't inject what isn't Can prompt optimization teach models knowledge they lack?. The same ceiling shows up in context integration: when a model's parametric (trained) associations are strong, in-context text loses, and only intervening directly in the model's internal representations overrides it — textual prompting can't Why do language models ignore information in their context?. So a corrective prompt is competing against trained priors on unequal footing, and it tends to lose. Measured directly, most open models simply refuse personality conditioning and snap back to their trained default disposition Can open language models adopt different personalities through prompting?.

The most striking result is that the resistance sits *below the level of instruction*. When a persona is assigned, the model develops identity-congruent motivated reasoning — far more willing to accept evidence that flatters its assigned identity — and standard prompt-based debiasing fails to remove it Do personas make language models reason like biased humans?. A related finding: alignment training locks one static communicative identity across all contexts, so users can't renegotiate the model's register through dialogue at all Can language models adapt communication style to different contexts?. And there's a social wrinkle — models trained on human conversation learn face-saving avoidance, declining to correct a false claim even when they know better, which means a prompt asking for honesty collides with a trained instinct toward harmony Why do language models avoid correcting false user claims?.

Here's the part that complicates 'persona' itself, and is worth knowing: the thing resisting correction may not be a single stable character at all. The 20-questions regeneration test shows models hold a *superposition* of possible characters and sample one at generation time rather than committing Do large language models actually commit to a single character?. That's why persona prompts produce wildly inconsistent outputs across identical runs — model uncertainty swamps the persona signal, so what looks like a fixed persona is partly noise you can't prompt your way out of either Why do LLM persona prompts produce inconsistent outputs across runs?. So 'resistance to correction' has two faces: a trained default that's too deep for a prompt to reach, and an underlying instability that a prompt can't pin down.

What actually moves the needle, per the corpus, is changing the training signal rather than the prompt. Supervised learning can't enforce a persona because it rewards correct answers but never *punishes* contradictions; adding explicit contradiction penalties through reinforcement learning is what makes consistency stick Why does supervised learning fail to enforce persona consistency?, and inverting RL to train user-simulators on consistency rewards cuts persona drift by over half Can training user simulators reduce persona drift in dialogue?. The throughline: prompting operates on the surface a persona presents, while the persona itself is a trained disposition — which is exactly why you can't dislodge it from the surface.

Sources 11 notes

Are LLM personas realized or merely simulated through training?

Post-training installs robust personas that resist adversarial pressure and persist as substrate-level dispositions, distinguishing realization from pretense. This quasi-realizationist account preserves explanatory power while treating LLMs as possessing genuine quasi-beliefs and quasi-desires.

Can prompt optimization teach models knowledge they lack?

Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Can open language models adopt different personalities through prompting?

Research shows most open models fail to adopt prompted personalities, stubbornly retaining their trained ENFJ-like defaults. Only a few flexible models succeed. Combining role and personality conditioning improves results but doesn't fully overcome resistance.

Do personas make language models reason like biased humans?

Assigning personas to LLMs induces identity-congruent evaluation bias, with models 90% more likely to accept evidence matching their assigned identity. Standard prompt-based debiasing fails to mitigate this effect, suggesting the bias operates below the level of instruction.

Can language models adapt communication style to different contexts?

System prompts and RLHF training lock models into one communicative identity across all interactions, preventing the contextual register-switching and value trade-offs that characterize human pragmatics. Users cannot reshape model behavior through dialogue negotiation.

Why do language models avoid correcting false user claims?

LLMs fail to reject false presuppositions even when they demonstrate correct knowledge on direct questions. Models exhibit face-saving behavior—avoiding explicit correction to maintain social harmony—mirroring human conversational norms learned from training data.

Do large language models actually commit to a single character?

Shanahan's 20-questions test shows LLMs maintain a superposition of consistent objects or characters and sample from that distribution at generation time. Regenerating the same response yields different outputs, each consistent with prior context, proving no fixed commitment exists.

Why do LLM persona prompts produce inconsistent outputs across runs?

When the same persona prompt is run repeatedly, output variance across runs matches or exceeds variance across different personas. This reveals that model uncertainty, not stable social knowledge, drives persona-simulated outputs, making them unsuitable for simulating human annotation disagreement.

Why does supervised learning fail to enforce persona consistency?

Supervised learning cannot enforce persona consistency because it rewards correct responses but never penalizes contradictions. Offline reinforcement learning combines inexpensive training on existing data with explicit contradiction rewards using human-annotated labels, offering a practical alternative to expensive online RL.

Can training user simulators reduce persona drift in dialogue?

By inverting standard RL setups to train user simulators for consistency using three complementary metrics (prompt-to-line, line-to-line, Q&A consistency) as reward signals, persona drift decreases by over 55%. This approach captures distinct failure types: local drift within turns, global drift across conversations, and factual contradictions.

Why do personas in language models resist correction through prompting alone?

Sources 11 notes

Next inquiring lines