Can compressive memory track what matters most across 35 conversation sessions?

This explores whether a single compressing memory — one model that keeps rewriting a summary of the conversation instead of looking things up — can actually hold onto what matters over many sessions, and where that approach breaks down.

This explores whether 'compressive' memory — collapsing a long conversation history into one continuously rewritten summary rather than retrieving past turns on demand — can keep what matters across dozens of sessions. The corpus has a direct answer and a warning attached to it. COMEDY folds memory generation, compression, and response into a single operation, tracking event recaps, user portraits, and relationship dynamics with no vector database in the loop Can a single model replace retrieval for long-term conversation memory?. The appeal is obvious: no retrieval bottleneck, no guessing which old turn is relevant. But the same note flags a fragile consolidation pattern — continuous reprocessing follows an inverted-U curve and can actually drop *below* a no-memory baseline once misgrouping, lost context, and overfitting accumulate. So the honest answer to 'across 35 sessions' is: it can, until it can't, and the failure is gradual rather than obvious.

The more interesting discovery is that several notes converge on *why* compression decays — and they point to selection as the missing ingredient. Including everything turns out to hurt: selective history retrieval beats full-context inclusion because topic switches inject irrelevant information, and jointly learning what to select beats both full context and human annotation Does including all conversation history actually help retrieval?. Compressive memory is, in a sense, full-context inclusion smeared into a summary — which is exactly why 'what matters most' is the hard part. A summary that compresses indiscriminately carries forward the same noise that selective retrieval was designed to drop.

There's a sharper reframe hiding in the corpus: the bottleneck may not be memory at all, but *compute*. One note argues the long-context problem is really the cost of consolidating evicted context into the model's fast weights during offline 'sleep' phases, and that performance keeps improving with more consolidation passes — a test-time scaling pattern Is long-context bottleneck really about memory or compute?. Read alongside COMEDY's inverted-U, this suggests the decay across sessions isn't because the summary is too small, but because each cheap, in-line rewrite under-processes what it absorbs. Tracking what matters across 35 sessions might be less about a bigger memory and more about spending more thinking on each consolidation.

What to actually *store* is its own question, and here the corpus pushes against raw compression. The PRIME work finds that semantic memory — abstracted preference summaries — consistently beats episodic recall of specific past interactions, and notably that recency-based recall beats similarity-based retrieval Does abstract preference knowledge outperform specific interaction recall?. That's an argument *for* compression done right: distill preferences, don't hoard transcripts. The recommender-systems angle adds the structural piece compression tends to flatten — users have at least three distinct preference channels (current session, historical dialogue, look-alike users), and collapsing them loses signal that traditional systems proved valuable Can conversational recommenders recover lost preference signals from history?.

The thing you might not have known you wanted to know: 'what matters most' in a conversation isn't only informational. One note argues conversation maintenance — reference repair, topic hand-off, the relational glue — is social action that models never learn because training rewards information prediction, not relational work Why don't language models develop conversation maintenance skills?. A compressive memory optimized to summarize *content* across 35 sessions can faithfully track facts and still drop the thread of the relationship — which, over that many sessions, is often the thing a user most expects to be remembered.

Sources 6 notes

Can a single model replace retrieval for long-term conversation memory?

COMEDY merges memory generation, compression, and response into one operation, tracking event recaps, user portraits, and relationship dynamics without vector-DB retrieval. However, empirical work shows continuous reprocessing follows an inverted-U curve, degrading below no-memory baseline due to misgrouping, context loss, and overfitting.

Does including all conversation history actually help retrieval?

Research shows that automatically selecting relevant previous turns improves retrieval effectiveness more than including all context. Topic switches inject irrelevant information; joint optimization of selection and retrieval beats both full-context baselines and human annotation.

Is long-context bottleneck really about memory or compute?

Research shows the bottleneck is not memory capacity but the compute required to consolidate evicted context into fast weights during offline sleep phases. Performance improves with more consolidation passes, following a test-time scaling pattern on harder reasoning tasks.

Does abstract preference knowledge outperform specific interaction recall?

PRIME framework shows semantic memory (preference summaries, parametric encodings) consistently beats episodic memory (retrieved past interactions) across models. Recency-based recall outperforms similarity-based retrieval, and task fine-tuning exceeds preference tuning methods.

Can conversational recommenders recover lost preference signals from history?

Current CRS systems only use the active dialogue session to infer preferences, losing item-CF and user-CF signals proven valuable in traditional recommenders. Integrating current session, historical dialogues, and look-alike users—conditioned on current intent—recovers essential user representation structure.

Why don't language models develop conversation maintenance skills?

Humans keep conversations smooth through implicit techniques like reference repair and topic hand-off that sustain relational interaction, not convey information. Language models don't develop these because training signals reward information prediction, not relational work.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a researcher stress-testing claims about compressive memory in long-horizon dialogue. The question remains open: can compressive memory reliably track what matters most across 35+ conversation sessions?

What a curated library found — and when (dated claims, not current truth):
Findings span 2021–2025, concentrated in 2024–2025:
• Compressive memory (single-model, in-line summarization) follows an inverted-U curve — performance degrades once misgrouping, lost context, and overfitting accumulate, eventually dropping below no-memory baseline (2024).
• Selective history retrieval outperforms full-context inclusion; indiscriminate compression carries forward noise that selective methods were designed to drop (2023–2024).
• The bottleneck may be *compute*, not memory size: performance improves with more consolidation passes during offline 'sleep' phases, suggesting test-time scaling on context eviction (2024).
• Semantic memory (abstracted preference summaries) beats episodic recall; recency-based retrieval outperforms similarity-based retrieval for personalization (2025).
• Conversation maintenance — reference repair, relational glue — is social action rarely captured in training, so compression optimized for content across 35 sessions can lose the relationship thread (2023–2024).

Anchor papers (verify; mind their dates):
• arXiv:2402.11975 (Feb 2024): Compress to Impress — empirical inverted-U on long-term compressive memory.
• arXiv:2507.04607 (Jul 2025): PRIME — semantic memory and cognitive abstraction vs. episodic recall.
• arXiv:2306.02553 (Jun 2023): Learning to Relate to Previous Turns — dialogue coherence and turn linking.
• arXiv:2512.24601 (Dec 2025): Recursive Language Models — potential relevance to iterative consolidation.

Your task:
(1) RE-TEST EACH CONSTRAINT. For the inverted-U decay, the compute-as-bottleneck hypothesis, and the selective-vs.-full-context trade-off: judge whether newer architectures (e.g., mixture-of-experts, sparse consolidation, or external retrieval integrated into forward passes), training objectives (e.g., explicit memory-selection loss), or inference harnesses (streaming caches, resumable consolidation tokens) have since relaxed the decay curve or shifted the compute–quality frontier. Separate the durable question (how to maintain signal over 35+ sessions) from the perishable limitation (current in-line compression is too cheap).
(2) Surface the strongest *contradicting* or *superseding* work from the last ~6 months. Look especially for papers on retrieval-augmented generation with memory, or dialogue systems that explicitly learn what to compress.
(3) Propose 2 research questions that *assume* the regime may have moved: (a) If compute budget is the real lever, what is the minimal consolidation overhead needed to keep the inverted-U from inverting? (b) Can a compressive system learn to preserve relational semantics (reference repair, turn adjacency) without explicit social-action annotations?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Can compressive memory track what matters most across 35 conversation sessions?

Sources 6 notes

Next inquiring lines