When does combining episodic and semantic memory reduce personalization performance?

This explores the trade-off between two memory types — episodic (recalling specific past interactions) and semantic (storing abstracted preference summaries) — and asks when blending them actually hurts rather than helps personalization.

This explores when stacking episodic memory (verbatim recall of past interactions) on top of semantic memory (distilled preference summaries) backfires for personalization — and the corpus suggests the answer is less about the combination itself and more about what episodic recall drags in when the two aren't kept clean. The sharpest evidence comes from the PRIME framework, which finds that abstract preference knowledge consistently beats retrieving specific past interactions across models Does abstract preference knowledge outperform specific interaction recall?. So when you fold episodic retrieval back into a semantic system, you risk diluting the signal with noisier, more literal material — the semantic layer was already doing the heavy lifting.

The most striking failure mode is an uncanny-valley effect. PRIME shows a U-shaped error curve where replacing a user's profile with a *nearly* matching one causes the steepest performance drop — worse than an obviously wrong profile Why do similar user profiles produce worse personalization errors?. This matters for episodic+semantic blends because episodic recall works by similarity: it surfaces the most similar past interactions, which is exactly the regime where the model confidently applies almost-right-but-wrong preferences. The combination degrades performance precisely when retrieved episodes are close-but-not-true matches, and the model has no way to know it's been misled.

There's also a question of *what* episodic memory captures. Personalization turns out to ride on style and output preferences, not the semantic content of what a user asked — profiles built from a user's past outputs match or exceed full profiles, while input-heavy profiles actively degrade results Do user outputs outperform inputs for LLM personalization?. Episodic recall that hauls in raw queries and context can therefore introduce content that pulls the model off the preference signal it actually needed.

The corpus does point at a way to combine them without the penalty: keep them architecturally separate rather than fused. M3-Agent stores episodic events and semantic knowledge as distinct layers in an entity-centric graph, so semantic preferences are inferred *from* episodes but the two aren't collapsed into one retrieval pool Can agents learn preferences by watching rather than asking?. The lesson across these notes is consistent — episodic and semantic memory hurt personalization when merged into a single similarity-driven lookup, and help when semantic abstraction is allowed to override raw recall.

Worth knowing on the side: even good reasoning can sabotage personalization if it ignores user context — generic chain-of-thought underperforms here, and only customized thinking traces recover both depth and relevance Why does chain-of-thought reasoning fail for personalization?. The throughline is that more information is not the win; the right *abstraction* of it is.

Sources 5 notes

Does abstract preference knowledge outperform specific interaction recall?

PRIME framework shows semantic memory (preference summaries, parametric encodings) consistently beats episodic memory (retrieved past interactions) across models. Recency-based recall outperforms similarity-based retrieval, and task fine-tuning exceeds preference tuning methods.

PRIME shows a U-shaped error curve where most-similar profile replacements cause steepest performance drops. The model confidently applies wrong preferences when profiles are nearly but not truly matched, an uncanny valley effect more harmful than obvious mismatch.

Do user outputs outperform inputs for LLM personalization?

Research shows that user profiles built from outputs alone match or exceed performance of complete profiles across multiple tasks, while input-only profiles degrade performance. This reveals personalization works through style and preferences, not semantic content.

Can agents learn preferences by watching rather than asking?

M3-Agent demonstrates that separating episodic events from semantic knowledge in an entity-centric graph, combined with parallel memorization and control processes, allows agents to infer and act on user preferences without asking. This architecture mirrors human cognitive systems that bind disparate information about individuals across sensory modalities.

Why does chain-of-thought reasoning fail for personalization?

Generic chain-of-thought underperforms for personalization because it ignores user context. Fine-tuning destroys reasoning capacity entirely. Self-distillation lets models generate customized thinking traces that maintain both depth and relevance.

When does combining episodic and semantic memory reduce personalization performance?

Sources 5 notes

Next inquiring lines