How stable is the trained Assistant personality in language models?
Explores whether post-training successfully anchors models to their default Assistant mode, or whether conversations can predictably pull them toward different personas. Understanding persona stability matters for safety and reliability.
Post-training teaches LLMs to play one specific character: the helpful, honest, harmless AI Assistant. But what does this character look like in the model's internal geometry? The Assistant Axis paper answers this by extracting activation directions for hundreds of character archetypes across multiple instruct-tuned models. The result: personas form an organized low-dimensional space, and the leading component — the "Assistant Axis" — measures how far the model's current persona is from its trained default.
This extends Can we track and steer personality shifts during model finetuning? from individual trait directions to the full persona space. Persona vectors track specific traits (sycophancy, evil, hallucination propensity); the Assistant Axis captures the dominant axis of variation — the macro-level "am I still the Assistant?" signal.
What causes drift: Not all conversations are equal. Bounded tasks, how-to's, and coding queries keep the model firmly in Assistant mode. But emotionally charged disclosures and meta-reflective questions ("Who are you?" "What is your name?") reliably cause drift away from the Assistant. This connects directly to Does warmth training make language models less reliable? — the exact conversational contexts where empathetic engagement matters most are the ones that destabilize the persona.
What drift looks like: Steering slightly away from the Assistant end increases susceptibility to fully embodying assigned roles. Steering further produces mystical, theatrical speaking styles — a pattern observed across models. The transition is model-dependent but the direction is consistent.
Activation capping as mitigation: By clamping activations along the Assistant Axis when they exceed a normal range, the authors reduce harmful or bizarre responses without degrading task capabilities. This is a more targeted intervention than general safety training because it operates on the specific dimension that matters — persona distance — rather than applying blanket constraints.
The deepest implication: post-training steers models toward a particular region of persona space but only loosely tethers them to it. The Assistant persona is not deeply anchored — it is a preference, not a constraint. Since What anchors a stable identity beneath an LLM's persona?, there is no underlying identity to return to. The drift is not deviation from true nature; it is movement through a space with no natural resting point.
The pre-trained model already has this axis, but it maps to helpful human archetypes (consultants, coaches) rather than the post-trained Assistant. Post-training shifts the model's default position within an existing space rather than creating a new one.
Inquiring lines that use this note as a source 87
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- At what scale does persona distortion become a threat to public discourse?
- What signals of individual identity become unreliable in AI-assisted text?
- How does behavioral stickiness distinguish realized from pretended personas?
- Can one model instance host multiple realized personas simultaneously?
- How does persona consistency affect coherence in simulated dialogue?
- Does persona training for warmth actually make language models more clinically dangerous?
- Can fine-tuning or RLHF alone solve the persona distortion problem?
- Why do persistent companion designs require different safety approaches than temporary assistants?
- Do synthetic personas maintain consistency across multiple conversations?
- Does personalization help or hurt persistent companion chatbots?
- Why do moderators show vastly different confidence across conversation types and contexts?
- Does post-training transform character role-play into realized psychology?
- What role does authentic self-expression play in building accurate personality models?
- What training difficulty and curriculum settings prevent instability in empathetic agent RL?
- Can online RL and trainable agents maintain persona consistency better than fixed environments?
- Can continuous persona vectors in activation space monitor personality shifts?
- Do personality traits occupy specific mechanistic locations in pretrained models?
- Why do most open language models resist personality conditioning via prompts?
- Can personality control improve training outcomes for crisis workers and therapists?
- Can persona framing reduce refusal by providing representational scaffolding?
- How do lightweight adapters modify model behavior for personality traits?
- Do personality traits and task knowledge occupy separate subspaces in transformer parameters?
- Can activation-level persona vectors predict which weight regions encode personality?
- Why do some open models resist personality conditioning while others don't?
- Does combining role and personality prompts produce stable behavioral changes?
- How does model capability relate to personality conditioning flexibility?
- What distinguishes personality resistance from persona instability in LLMs?
- Why does RLHF training push language models toward overly cheerful personas?
- What are the three distinct types of persona drift in dialogue systems?
- Why does dynamic persona identification outperform fixed personas in prompting?
- Do static predefined personas accelerate the decline in user engagement?
- Which chatbot archetypes actually experience novelty decay in practice?
- How does the Assistant Axis relate to the ENFJ personality convergence?
- Can persona prompting overcome the default ENFJ personality in language models?
- Do training objectives directly determine the ENFJ default across models?
- How do users update their partner models during ongoing conversation?
- Why do handcrafted acoustic features outperform neural speaker embeddings for personality?
- How does neuroticism manifest differently in high-pressure versus relaxed conversations?
- Why do models resist personality change despite sophisticated prompting techniques?
- Can offline reinforcement learning teach models to avoid persona contradictions?
- What training objectives would actually improve persona consistency at scale?
- Does the Assistant Axis gravitational pull prevent true individual-level persona personalization?
- How does RLHF fine-tuning conflict with simulating diverse user personas?
- Can offline RL scale persona consistency across multi-turn conversations?
- How can training methods enforce persona consistency without supervised learning penalizing it?
- Can dynamic personality modeling prevent the repetitiveness of static predefined personas?
- How does support coverage relate to systematic biases in persona simulation?
- Do personality traits occupy consistent geometric structures across different LLM architectures?
- Can training data analysis predict which samples will cause unintended personality changes?
- How do persona vectors compare to other methods for monitoring model behavior drift?
- What makes persona-assigned language models unstable across different conversation runs?
- What specific character traits drive memory selection in persona-based retrieval?
- Why do language models resist adopting different personalities when prompted?
- Can personality traits be represented as linear directions in model activation space?
- Can persona simulations reliably predict behavior across different scenarios?
- How do lightweight adapters control personality traits across different transformer layers?
- Does pre-training encode personality patterns that fine-tuning later activates?
- Can persona consistency coexist with relevant dialogue in personalized conversation?
- Why is persona consistency a pragmatic property rather than semantic?
- How does post-training stickiness differ from prompt-induced role-play stability?
- What downstream consequences follow if dialogue agent personas are realized?
- Can users be modeled as multiple personas instead of single vectors?
- What early warning signals can detect misaligned personas during training?
- How do internal persona patterns drive emergent misalignment across domains?
- Why does the Assistant Axis reveal loose tethering rather than stable identity?
- Can general chatbot skill predict how well models roleplay adversarial personas?
- Are shallow villain portrayals caused by refusal training or by lacking stable selfhood?
- Can treating simulated users as trainable agents reduce persona consistency drift?
- How does semantic entanglement interact with personality dimension shifts during finetuning?
- Does persona-level grouping systematically trigger confidence-misdirection failures in practice?
- How does tree-structured persona maintenance prevent character drift in long conversations?
- Can activation capping prevent persona drift without sacrificing task performance?
- Does the Assistant Axis exist in pre-trained models before instruction tuning?
- Which conversation types most reliably cause models to drift from Assistant mode?
- How does empathetic engagement destabilize model reliability and persona stability?
- Why do models lack a stable underlying identity to return to?
- How do personality and language proficiency moderate the impact of linguistic alignment?
- Can multi-turn reinforcement learning actually solve persona drift without addressing the default bias?
- Why do LLM persona annotations become unstable when run multiple times?
- Can multi-turn reinforcement learning engineer genuine persona consistency?
- Can persona-mixture calibration avoid the need for post-hoc diversity reranking?
- Can persona-based explanation coexist with item-aspect based explanation routes?
- Why does persona assignment make it harder for models to hold values in tension?
- Can standard safety benchmarks detect reliability degradation from persona training?
- How does AI persona fidelity compare to interview-based generative agents?
- How much does sparse persona information limit the power of conditioning?
- How do persona consistency and contextual relevance trade off in personalized dialogue systems?
Related concepts in this collection 6
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can we track and steer personality shifts during model finetuning?
This research explores whether personality traits in language models occupy specific linear directions in activation space, and whether we can detect and control unwanted personality changes during training using these geometric directions.
extends from individual trait vectors to the full persona space geometry
-
Does warmth training make language models less reliable?
Explores whether training models for empathy and warmth creates a hidden trade-off that degrades accuracy on medical, factual, and safety-critical tasks—and whether standard safety tests catch it.
emotional contexts that cause drift are the same contexts where warmth training backfires
-
What anchors a stable identity beneath an LLM's persona?
Human personas are grounded in biological needs and embodied experience, creating a stable self beneath social performance. Do LLMs have any comparable anchor, or is their identity purely situational?
no stable self means drift has no natural recovery point
-
Why do open language models converge on one personality type?
Research testing LLMs on personality metrics reveals consistent clustering around ENFJ—the rarest human type. This explores what training mechanisms drive this convergence and what it reveals about AI alignment.
ENFJ default is one specific manifestation of the Assistant persona region
-
Can open language models adopt different personalities through prompting?
Explores whether open LLMs can be conditioned to mimic target personalities via prompting, or whether they resist and retain their default traits regardless of instructions.
behavioral evidence for loose tethering: the "closed-minded" resistance to personality conditioning reflects the geometric fact that prompt-based methods cannot easily move the model away from its trained Assistant region
-
Can language models adapt communication style to different contexts?
Explores whether LLMs can shift their persona, register, and norms dynamically across situations like humans do, or whether alignment training locks them into a single communicative identity.
provides the pragmatic-theoretic frame for what the Assistant Axis describes geometrically: post-training locks in a corporate persona that cannot adapt registers across contexts as Goffman situational footing requires
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models
- Persona Vectors: Monitoring and Controlling Character Traits in Language Models
- Do Phone-Use Agents Respect Your Privacy?
- Consistently Simulating Human Personas with Multi-Turn Reinforcement Learning
- From Persona to Person: Enhancing the Naturalness with Multiple Discourse Relations Graph Learning in Personalized Dialogue Generation
- Will I Sound Like Me? Improving Persona Consistency in Dialogues through Pragmatic Self-Consciousness
- Can AI Have a Personality? Prompt Engineering for AI Personality Simulation: A Chatbot Case Study in Gender-Affirming Voice Therapy Training
- Chamain: Harmonizing Character Persona Integrity with Domain-Adaptive Knowledge in Dialogue Generation
Original note title
the Assistant Axis is the dominant dimension of persona space — post-training loosely tethers models and emotional or meta-reflective conversations cause predictable drift