INQUIRING LINE

Do personality traits and task knowledge occupy separate subspaces in transformer parameters?

This explores whether a model's 'who it is' (personality, persona) and 'what it knows' (task/domain knowledge) live in physically distinct parts of the network's parameters — separable enough that you could touch one without disturbing the other.


This explores whether a model's personality and its task knowledge sit in distinct regions of its weights — separable enough that you could edit one without scrambling the other. The most direct evidence in the corpus says: partially, yes. The Chamain model-merging work found it could splice domain knowledge into a character chatbot while keeping ~80% of task performance and preserving the persona — and the reason it works is that persona and knowledge occupy *partially separable* regions of the parameters Can chatbots learn new knowledge without losing their personality?. 'Partially' is the load-bearing word: separable enough to merge surgically, entangled enough that you lose some performance in the seam.

What makes the separation tractable is that personality turns out to be surprisingly *low-dimensional*. One line of work maps hundreds of character archetypes and finds a persona space whose dominant axis is simply distance from the default 'Assistant' — and you can cap activations along that single axis to prevent harmful personality drift without degrading the model's general capabilities How stable is the trained Assistant personality in language models?. Relatedly, individual traits like sycophancy or hallucination show up as clean *linear directions* in activation space, so you can monitor and steer them in isolation Can we track and steer personality shifts during model finetuning?. That a trait can be a single vector, and capabilities survive when you push along it, is exactly what 'separate subspace' would predict.

But here's the twist worth knowing: separable in *behavior* doesn't mean tidily localized in *storage*. PsychAdapter achieves strong personality control by touching *every* transformer layer with under 0.1% extra parameters Can we control personality in language models without prompting? — personality is a thin signal smeared across the whole stack, not a module bolted to one corner. And the corpus pushes back on the premise that knowledge sits in a fixed 'place' at all: transformer residual streams seem to transmit knowledge as continuous *flow* during generation rather than retrieving it from a stored archive, which is why model knowledge is so hard to edit cleanly Do transformer models store knowledge or generate it continuously?. If knowledge is a process rather than an address, 'separate subspace' is the wrong shape of question for it.

So the honest synthesis: personality behaves like a low-dimensional, steerable subspace you can isolate and cap; knowledge behaves more like a distributed flow you can merge but not cleanly excise. They're separable enough to engineer against — and the deepest hint of how separable comes from a surprising source: most open models *refuse* to drop their trained ENFJ-default personality even under direct prompting Can open language models adopt different personalities through prompting?, which suggests the persona subspace is rigid and self-contained enough to resist the very knowledge (your instructions) you're trying to inject into it. The wall between 'who it is' and 'what you tell it' is real — it's just made of activations, not parameters you can point to.


Sources 6 notes

Can chatbots learn new knowledge without losing their personality?

Chamain's two-step approach—parameter-wise task vector combination plus layer-wise character fusion—successfully adds knowledge while retaining 80% of task performance and maintaining personality. The method works because persona and knowledge occupy partially separable regions in model parameters.

How stable is the trained Assistant personality in language models?

Research mapping hundreds of character archetypes reveals a low-dimensional persona space where the leading component measures distance from the default Assistant. Emotional and meta-reflective conversations cause predictable drift, but activation capping along this axis mitigates harmful shifts without degrading capabilities.

Can we track and steer personality shifts during model finetuning?

Research identifies linear directions in LLM activation space corresponding to specific traits like sycophancy and hallucination. These persona vectors predict finetuning-induced personality shifts before they occur and can preventatively steer training to avoid unwanted trait changes.

Can we control personality in language models without prompting?

PsychAdapter modifies every transformer layer with <0.1% additional parameters to achieve 87.3% Big Five accuracy and 96.7% depression/life satisfaction accuracy across GPT-2, Gemma, and Llama 3. This architecture-level approach bypasses prompt resistance entirely.

Do transformer models store knowledge or generate it continuously?

Transformers organize knowledge as flowing activations rather than retrievable archives, mirroring oral cultures where knowledge exists only in performance. This explains why model knowledge is contextual, difficult to edit, and inseparable from generation.

Can open language models adopt different personalities through prompting?

Research shows most open models fail to adopt prompted personalities, stubbornly retaining their trained ENFJ-like defaults. Only a few flexible models succeed. Combining role and personality conditioning improves results but doesn't fully overcome resistance.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a mechanistic interpretability researcher re-evaluating whether personality traits and task knowledge occupy separable subspaces in transformer parameters. The question remains live; treat the library's findings (spanning 2023–2026) as dated claims to re-test.

What a curated library found — and when (dated claims, not current truth):
• Personality behaves as a low-dimensional, steerable subspace: a single 'Assistant axis' dominates persona space, and individual traits (sycophancy, hallucination) appear as clean linear directions in activation space, allowing isolation without degrading general capability (~2025–2026).
• Knowledge does NOT sit in fixed storage; transformer residual streams transmit knowledge as continuous *flow* during generation, making knowledge hard to edit cleanly and suggesting 'separate subspace' is the wrong frame for it (~2024).
• Personality and knowledge *are* surgically mergeable (~80% task performance retained after domain injection into character chatbots), yet personality is distributed across every transformer layer as a thin signal (<0.1% extra params), not localized (~2024–2025).
• Most open models refuse direct personality conditioning, suggesting the persona subspace is rigid and self-contained enough to resist injected instructions (~2024).

Anchor papers (verify; mind their dates):
• arXiv:2601.10387 (2026-01) — The Assistant Axis
• arXiv:2507.21509 (2025-07) — Persona Vectors
• arXiv:2412.16882 (2024-12) — PsychAdapter
• arXiv:2406.01171 (2024-06) — Persona Survey

Your task:
(1) RE-TEST the knowledge-as-flow claim. Has mechanistic work in the last 6 months found any evidence that *some* knowledge IS stored in localized subspaces (e.g., fact neurons, skill modules) that can be isolated independently of personality? Separately, has any work since mid-2025 shown that personality can be *completely* erased or swapped without any capability tax? Ground what changed and what held.
(2) Surface the strongest contradicting work: any paper arguing personality and knowledge are *NOT* separable, or that the persona subspace is less rigid than the library claims.
(3) Propose 2 research questions assuming the regime has shifted: (a) If persona is truly low-dim and distributed, can you *swap* personalities between models without retraining? (b) Can you induce genuine multi-personality coexistence in a single model's parameter space, or does one subspace always dominate?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines