Can episodic raw memory outperform consolidated summaries in practice?

This explores whether keeping raw, unedited records of past events (episodic memory) can beat compressing them into tidy summaries (consolidated memory) — and the corpus shows the answer flips depending on what you're trying to remember.

This explores whether holding onto raw past interactions can beat distilling them into summaries, and the surprising thing the corpus reveals is that the two camps disagree — because they're remembering different kinds of things. The strongest case for raw memory comes from work showing that continuously summarizing an agent's experience actually makes it worse over time: consolidated textual memory follows an inverted-U curve, helping at first and then degrading until it underperforms keeping the episodes untouched, with one model failing more than half of problems it had previously solved Does agent memory degrade when continuously consolidated?. The single-model compression approach that folds memory-writing and answering into one operation runs into the same wall — it escapes the retrieval bottleneck but inherits a fragile consolidation pattern that drifts below even a no-memory baseline Can a single model replace retrieval for long-term conversation memory?.

The diagnosis is specific, and it's what makes this worth knowing: summarization fails in three named ways — misgrouping unrelated events, stripping the conditions that made an old lesson applicable, and overfitting to a narrow slice of experience Does agent memory degrade when continuously consolidated?. In other words, a summary throws away exactly the contextual fine print that lets you tell when a past solution actually transfers. Raw episodes keep that fine print.

But the corpus refuses to crown raw memory outright. The opposing result is just as sharp: for personalization, abstract preference summaries consistently beat retrieving specific past interactions — and pulling recent episodes works better than pulling similar ones Does abstract preference knowledge outperform specific interaction recall?. The reconciliation is that these tasks reward different things. Remembering who a user *is* benefits from compression into stable traits; remembering how to *solve a problem you've solved before* punishes it, because the discarded details were load-bearing.

That splits the design space into 'when' and 'how.' On the 'how' side, the failures above seem to be about bad consolidation rather than consolidation itself: agents that fold history into explicit episodic, working, and tool schemas — with the autonomy to pause and reconsider — cut token overhead without the degradation that plagues naive summarization Can agents compress their own memory without losing critical details?. Structure and selectivity matter more than the raw/summarized binary. A complementary reframing argues the real bottleneck isn't storage capacity at all but the *compute* needed to transform evicted context into durable internal state, with quality improving the more consolidation passes you spend Is long-context bottleneck really about memory or compute? — implying many summaries fail simply because they were done too cheaply.

So the practical takeaway is less 'raw wins' and more 'cheap, eager compression loses.' Architectures that prioritize what's worth keeping rather than summarizing everything point the same direction — neural memory that preferentially stores *surprising* tokens rather than averaging the stream Can neural memory modules scale language models beyond attention limits?, and even memoryless reasoning that deliberately drops accumulated history to avoid the baggage that bloats long chains Can reasoning systems forget history without losing coherence?. The honest answer: episodic raw memory does outperform summaries when the task hinges on details consolidation discards — but a well-structured, selective, adequately-computed summary beats both.

Sources 7 notes

Does agent memory degrade when continuously consolidated?

LLM-consolidated textual memory degrades as experience accumulates, eventually performing worse than episodic-only retention. GPT-5.4 failed 54% of previously-solved problems after consolidation, with three mechanisms identified: misgrouping, applicability stripping, and overfitting on narrow streams.

Can a single model replace retrieval for long-term conversation memory?

COMEDY merges memory generation, compression, and response into one operation, tracking event recaps, user portraits, and relationship dynamics without vector-DB retrieval. However, empirical work shows continuous reprocessing follows an inverted-U curve, degrading below no-memory baseline due to misgrouping, context loss, and overfitting.

Does abstract preference knowledge outperform specific interaction recall?

PRIME framework shows semantic memory (preference summaries, parametric encodings) consistently beats episodic memory (retrieved past interactions) across models. Recency-based recall outperforms similarity-based retrieval, and task fine-tuning exceeds preference tuning methods.

Can agents compress their own memory without losing critical details?

DeepAgent's autonomous memory folding consolidates interaction history into episodic, working, and tool memory schemas. This reduces token overhead while letting agents pause to reconsider strategies—the autonomy and structure together avoid degradation that plagues poorly designed consolidation.

Is long-context bottleneck really about memory or compute?

Research shows the bottleneck is not memory capacity but the compute required to consolidate evicted context into fast weights during offline sleep phases. Performance improves with more consolidation passes, following a test-time scaling pattern on harder reasoning tasks.

Can neural memory modules scale language models beyond attention limits?

Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.

Can reasoning systems forget history without losing coherence?

Atom of Thoughts decomposes problems into DAGs and contracts them iteratively, ensuring each state depends only on the current problem—not prior steps. This memoryless approach eliminates historical baggage that bloats reasoning while maintaining answer equivalence.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question remains open: *Can episodic raw memory outperform consolidated summaries in practice?* Treat the findings below as dated claims (2024–2026) to be re-tested against the current frontier, not as settled truth.

What a curated library found — and when (2024–2026):
• Continuously summarized agent memory follows an inverted-U curve, degrading over time and eventually underperforming no-memory baselines; one model failed >50% of problems it had previously solved (~2024–2025).
• Summarization fails via three mechanisms: misgrouping events, stripping context that determines applicability, and overfitting to narrow experience slices (~2024).
• For personalization tasks, abstract preference summaries beat retrieving specific past episodes; task design determines the winner, not memory format alone (~2025).
• Structured episodic + working + tool schemas with selective consolidation and adequate compute avoid the degradation that plagues naive summarization (~2025).
• Memoryless reasoning and surprise-selective neural memory both outperform naive history accumulation (~2025–2026).

Anchor papers (verify; mind their dates):
• arXiv:2402.11975 (2024-02) — Compress to Impress
• arXiv:2501.00663 (2024-12) — Titans: Learning to Memorize at Test Time
• arXiv:2605.12978 (2026-05) — Useful Memories Become Faulty When Continuously Updated
• arXiv:2502.12018 (2025-02) — Atom of Thoughts for Markov LLM Test-Time Scaling

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For every finding above, assess whether newer models (GPT-4o, Claude 3.5, open-source LLMs post-2026-05), improved consolidation methods (e.g., hierarchical summarization, learned compression policies), better evaluation suites, or orchestration patterns (multi-pass refinement, caching strategies, hierarchical retrieval) have since relaxed or overturned it. Separate the durable core question (when does context detail matter for transfer?) from the perishable limitation (cheap, eager compression loses — is this still true?). Flag what resolved each constraint.
(2) **Surface the strongest CONTRADICTING or SUPERSEDING work** from the last ~6 months — especially papers showing raw memory failure modes or summaries that do NOT degrade.
(3) **Propose 2 research questions** that assume the memory regime may have evolved: e.g., does adaptive selective summarization (summing only non-salient episodes) now close the raw-vs-summary gap? Can learned compression objectives (e.g., RL-trained summary quality) prevent the inverted-U?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Can episodic raw memory outperform consolidated summaries in practice?

Sources 7 notes

Next inquiring lines