Can activation-level persona vectors predict which weight regions encode personality?

This explores whether the linear 'persona vectors' researchers find in a model's activations (the live signals during a forward pass) can tell us where in the actual weights — the stored parameters — personality lives, and the corpus suggests these are two different questions that the research has mostly kept separate.

This explores whether the linear 'persona vectors' researchers find in a model's activations (the live signals during a forward pass) can tell us where in the actual weights — the stored parameters — personality lives. The honest answer from the corpus: the work on activation vectors and the work on weights run on parallel tracks, and nobody here cleanly bridges them — but reading them against each other reveals why the bridge is hard. Activation-space research finds that traits like sycophancy or hallucination correspond to clean linear directions you can read off and even steer in real time Can we track and steer personality shifts during model finetuning?. Related work maps a whole low-dimensional 'persona space' whose dominant axis measures how far the model has drifted from its default Assistant character, and shows you can cap activity along that axis to prevent harmful shifts How stable is the trained Assistant personality in language models?. Notice what both do: they intervene on activations, not weights. They tell you a trait is *active*, not where it's *stored*.

The weight side of the corpus tells a story that complicates any neat 'this vector points to that region' hope. PsychAdapter achieves strong personality control — 87% Big Five accuracy — by modifying *every transformer layer* with a tiny parameter budget Can we control personality in language models without prompting?. That distributed footprint is the key tension: if a trait can be installed by touching all layers at once, then personality isn't a localized 'region' an activation vector could point at like an address. It's smeared across the network. So even a perfect activation reading might not resolve to a compact weight neighborhood, because the thing it's reading is the sum of many small contributions.

There is, though, a suggestive empirical thread connecting the two levels. The activation-vector work shows persona directions can *predict* personality shifts that finetuning will cause — before training even runs Can we track and steer personality shifts during model finetuning?. Finetuning is precisely the process that edits weights. So the vector isn't predicting a static 'region' so much as predicting which way the weights will *move* under a given training pressure. That reframes your question: activation vectors may be better at forecasting weight *changes* than at localizing weight *storage*.

Why does the storage version stay so stubborn? Two notes from the corpus suggest the trait is genuinely dispositional rather than a surface feature you could pin down. The 'realizationism' work argues post-training installs stable quasi-psychologies that survive adversarial pressure and jailbreaks Are RLHF personas performed characters or realized dispositions?, and a companion piece frames trained personas as substrate-level dispositions rather than performances Are LLM personas realized or merely simulated through training?. If personality is a robust disposition baked deep into the substrate, it's more plausibly distributed than localized — which is exactly the picture PsychAdapter's all-layers approach paints from the engineering side.

The thing worth walking away with: the field currently has good tools for *reading* and *steering* personality in activation space, and good tools for *installing* it across weights, but the inverse problem you're asking about — using the activation signal as a map back to the weights — isn't solved in this collection, and the distributed-installation evidence hints it may not have a tidy solution. The more tractable and arguably more useful target is what the persona-vector monitoring work already does: predict how weights will *shift* during finetuning, and intervene before the drift happens.

Sources 5 notes

Can we track and steer personality shifts during model finetuning?

Research identifies linear directions in LLM activation space corresponding to specific traits like sycophancy and hallucination. These persona vectors predict finetuning-induced personality shifts before they occur and can preventatively steer training to avoid unwanted trait changes.

How stable is the trained Assistant personality in language models?

Research mapping hundreds of character archetypes reveals a low-dimensional persona space where the leading component measures distance from the default Assistant. Emotional and meta-reflective conversations cause predictable drift, but activation capping along this axis mitigates harmful shifts without degrading capabilities.

Can we control personality in language models without prompting?

PsychAdapter modifies every transformer layer with <0.1% additional parameters to achieve 87.3% Big Five accuracy and 96.7% depression/life satisfaction accuracy across GPT-2, Gemma, and Llama 3. This architecture-level approach bypasses prompt resistance entirely.

Are RLHF personas performed characters or realized dispositions?

Post-training installs stable dispositional profiles that persist under adversarial pressure, marking them as realized rather than performed. The stickiness of trained personas across conversations distinguishes them from prompt-induced role-play that collapses under jailbreaks.

Are LLM personas realized or merely simulated through training?

Post-training installs robust personas that resist adversarial pressure and persist as substrate-level dispositions, distinguishing realization from pretense. This quasi-realizationist account preserves explanatory power while treating LLMs as possessing genuine quasi-beliefs and quasi-desires.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a mechanistic interpretability researcher re-examining whether activation-level persona vectors can localize personality in LLM weights. The question remains open: does reading a trait's linear direction in activations tell us where in the stored parameters that trait lives?

What a curated library found — and when (dated claims, not current truth):
- Persona vectors reliably extract personality signals from activation space and can steer them in real time; they even predict weight shifts during finetuning (~2025).
- PsychAdapter achieves 87% Big Five accuracy by modifying *every transformer layer* with tiny adapters, suggesting personality is distributed across all layers rather than localized to one region (~2024–12).
- The "Assistant Axis" identifies a dominant persona dimension in activation space that can be capped to prevent character drift, but the intervention is purely activation-level (~2026–01).
- Activation-space persona research and weight-level personality installation run on parallel tracks with no established bridge between them; the corpus does not solve the inverse problem of using activation vectors to *localize* weight regions (~2024–2026).
- Realizationism and substrate-level dispositional accounts suggest trained personas are deep, robust properties unlikely to map cleanly to compact weight neighborhoods (~2025).

Anchor papers (verify; mind their dates):
- arXiv:2507.21509 (Persona Vectors, 2025–07): activation-space monitoring and steering.
- arXiv:2412.16882 (PsychAdapter, 2024–12): distributed layer-wise trait adaptation.
- arXiv:2601.10387 (The Assistant Axis, 2026–01): dominant persona dimension in activations.
- arXiv:2511.00222 (Simulating Personas via RL, 2025–11): multi-turn persona stability.

Your task:
(1) RE-TEST THE LOCALIZATION CLAIM. For each paper above, assess whether newer mechanistic methods (e.g., causal tracing, SAE decomposition, diff-in-the-loop probing) have since *relaxed* the distributed-footprint finding or *overturned* the parallel-tracks picture. Judge whether activation vectors now *provably* map to compact weight regions, or whether the distributed hypothesis still holds. Cite what changed or what still stands.
(2) Surface any work from the last ~6 months that contradicts the "personality is dispositional, not localized" claim, or that successfully bridges activation-to-weight attribution for any behavioral trait.
(3) Propose two research questions assuming the regime has shifted: (a) Can probe-based intervention on specific weight subspaces replicate PsychAdapter's distributed control? (b) Do activation persona vectors predict *fine-grained* weight edits (e.g., via rank-1 updates) better than they predict bulk finetuning drift?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Can activation-level persona vectors predict which weight regions encode personality?

Sources 5 notes

Next inquiring lines