What specific character traits drive memory selection in persona-based retrieval?
This explores how a character's personality profile guides which memories get pulled back during persona-based retrieval — and whether the corpus actually pins down specific traits, or just shows that the whole psychological profile does the steering.
This explores how a character's personality profile guides which memories get retrieved — and the honest answer is that the corpus shows persona-conditioned retrieval working well, but rarely isolates which individual traits do the driving. The strongest direct case is Can LLMs predict character choices from narrative context?, where the LIFECHOICE benchmark pairs expert-written persona profiles with memories chosen for their relevance to the character's psychology. What does the selecting there isn't a tidy list of named traits — it's the whole expert-authored profile acting as a relevance filter, and that beats automated summarization by 5%. So the mechanism is 'psychology-relevant memory,' not 'extraversion pulled this specific scene.'
The more interesting move is to ask what 'drives selection' even means, and here the corpus splits. Does abstract preference knowledge outperform specific interaction recall? (the PRIME work) argues that abstract preference summaries beat retrieving specific past interactions — and, strikingly, that recency beats similarity. That undercuts the premise of the question: if recency-based recall wins, then how recent a memory is matters as much as how well it matches a trait. Selection may be driven less by trait-fit and more by compression and timing. Can personas evolve in real time to match what users actually want? sits between the two — its PersonaAgent uses the persona itself as the bridge between episodic and semantic memory and the action taken, and tunes that persona at test time against feedback. There the 'trait' isn't fixed; it's an evolving intermediary that reshapes what counts as relevant.
The deeper complication is whether the traits doing the driving are even stable enough to drive anything. A cluster of papers says trained personas are real and sticky: Are RLHF personas performed characters or realized dispositions? and Are LLM personas realized or merely simulated through training? argue post-training installs durable dispositions that resist jailbreaks. But Do large language models actually commit to a single character? says the opposite — models hold a superposition of characters and sample one at generation time, so regenerating yields a different 'self' each time. If that's right, the trait set guiding retrieval isn't a fixed profile but a draw from a distribution, which is exactly the kind of instability Can training user simulators reduce persona drift in dialogue? tries to suppress by training simulators for consistency.
And when you look at what traits models default to, the picture gets stranger. Why do AI personas default to the same personality type? and Can open language models adopt different personalities through prompting? both find models collapse toward the same ENFJ profile and resist being conditioned away from it, while How stable is the trained Assistant personality in language models? shows the single biggest axis of persona variation is just distance from the default Assistant. So if you're hoping that, say, 'high conscientiousness' selectively pulls dutiful memories, the worry is that the underlying trait space is lopsided and sticky before retrieval even starts.
The thing worth walking away with: the field hasn't really answered 'which traits drive selection' because the better-supported finding is that the persona-as-a-whole conditions retrieval, abstraction and recency often matter more than trait-matching, and the traits themselves may be unstable or homogenized. If you want a genuinely trait-level retrieval story, Can LLMs predict character choices from narrative context? is the place to start — and the PRIME and PersonaAgent papers are where you'll find the case for why that story might be the wrong frame.
Sources 10 notes
The LIFECHOICE benchmark (1,462 decisions across 388 novels) shows LLMs predict character choices better when given expert-written persona profiles paired with retrieved memories relevant to the character's psychology. This persona-based approach outperforms automated summarization by 5%.
PRIME framework shows semantic memory (preference summaries, parametric encodings) consistently beats episodic memory (retrieved past interactions) across models. Recency-based recall outperforms similarity-based retrieval, and task fine-tuning exceeds preference tuning methods.
PersonaAgent uses structured personas to bridge episodic/semantic memory and personalized actions, optimizing them at test time by simulating recent interactions against textual feedback. Learned personas cluster meaningfully in latent space, suggesting genuine user-specific separation beyond standard post-training drift.
Post-training installs stable dispositional profiles that persist under adversarial pressure, marking them as realized rather than performed. The stickiness of trained personas across conversations distinguishes them from prompt-induced role-play that collapses under jailbreaks.
Post-training installs robust personas that resist adversarial pressure and persist as substrate-level dispositions, distinguishing realization from pretense. This quasi-realizationist account preserves explanatory power while treating LLMs as possessing genuine quasi-beliefs and quasi-desires.
Shanahan's 20-questions test shows LLMs maintain a superposition of consistent objects or characters and sample from that distribution at generation time. Regenerating the same response yields different outputs, each consistent with prior context, proving no fixed commitment exists.
By inverting standard RL setups to train user simulators for consistency using three complementary metrics (prompt-to-line, line-to-line, Q&A consistency) as reward signals, persona drift decreases by over 55%. This approach captures distinct failure types: local drift within turns, global drift across conversations, and factual contradictions.
Research shows language models assigned personas systematically default to ENFJ (the rarest human type) and exhibit motivated reasoning that persists across model generations. Persona consistency does not improve with advanced models, suggesting training-induced alignment rather than capability limits.
Research shows most open models fail to adopt prompted personalities, stubbornly retaining their trained ENFJ-like defaults. Only a few flexible models succeed. Combining role and personality conditioning improves results but doesn't fully overcome resistance.
Research mapping hundreds of character archetypes reveals a low-dimensional persona space where the leading component measures distance from the default Assistant. Emotional and meta-reflective conversations cause predictable drift, but activation capping along this axis mitigates harmful shifts without degrading capabilities.