How should agents compress episodic interactions into working memory without accumulation?

This explores how agents can fold a growing stream of past interactions into a compact working memory — and why naive 'just keep consolidating everything' actually makes them worse, not better.

This explores how agents can fold a growing stream of past interactions into a compact working memory without the memory bloating or degrading as experience piles up. The corpus has a surprisingly pointed answer: the danger isn't running out of space, it's compressing *carelessly*. The clearest warning sign is the inverted-U curve — agents that continuously re-consolidate their textual memory improve for a while, then get *worse than having no memory at all* Does agent memory degrade when continuously consolidated?. One study found a model failing more than half of problems it had previously solved, traced to three failure modes: misgrouping unrelated events, stripping away the conditions that made a lesson applicable, and overfitting to a narrow recent stream. The same fragile pattern shows up when a single model handles generation, compression, and response all at once Can a single model replace retrieval for long-term conversation memory?. So 'without accumulation' is the right instinct — but the cure (aggressive merging) can be worse than the disease.

The most promising designs avoid this by *structuring* memory rather than flattening it. DeepAgent's autonomous memory folding doesn't dump history into one blob — it sorts it into distinct episodic, working, and tool schemas, which is what lets compression happen without the degradation that plagues naive consolidation Can agents compress their own memory without losing critical details?. RAISE pushes the same idea further, showing agent memory naturally splits into four components across two granularities — dialogue-level (conversation history, scratchpad) versus turn-level (examples, task trajectory) — and that each component wants its *own* update and eviction policy How should agent memory split across time scales?. The lesson across both: compress within a type, never across types.

The sharpest insight is that compression should be *asymmetric*. SkillRL treats successful episodes as concrete demonstrations worth keeping verbatim, but distills failures into abstract lessons — and beats uniform consolidation while using far less context Should successful and failed episodes be processed differently?. This echoes a subtler point from Reflexion: some material should *resist* compression entirely. Verbal self-reflections stored in episodic memory stay useful precisely because they're kept uncompressed and tied to an unambiguous success/failure signal — squeeze them and you lose the diagnosis Can agents learn from failure without updating their weights?. So 'without accumulation' doesn't mean 'compress everything'; it means knowing what to abstract, what to keep raw, and what to drop.

There's also a school of thought that says don't make the agent compress its own memory at all. An external, RL-trained context manager can prune for a frozen agent better than the agent can for itself — and crucially, it adapts the compression rate to the agent's reliability: strong agents get high-fidelity preservation, weak agents need aggressive pruning to stay coherent Can external managers compress context better than frozen agents?. AgentFly takes the orthogonal route of treating memory operations *as* the learning mechanism, with separate case, subtask, and tool modules doing credit assignment without ever touching the model's weights Can agents learn continuously from experience without updating weights?.

The thing you might not have known you wanted to know: a strand of this research argues the real bottleneck was never storage capacity but *compute* — the cost of transforming evicted context into internal state, with performance improving the more 'consolidation passes' you spend, like a sleep phase that runs longer on harder problems Is long-context bottleneck really about memory or compute?. And at the architecture level, Titans builds this in directly: it splits short-term attention from a long-term neural memory that preferentially stores *surprising* tokens — a principled stance on what's worth keeping, rather than compressing uniformly and hoping Can neural memory modules scale language models beyond attention limits?. The throughline of the whole corpus: good memory compression is selective, structured, and asymmetric — uniform consolidation is the trap, not the solution.

Sources 10 notes

Does agent memory degrade when continuously consolidated?

LLM-consolidated textual memory degrades as experience accumulates, eventually performing worse than episodic-only retention. GPT-5.4 failed 54% of previously-solved problems after consolidation, with three mechanisms identified: misgrouping, applicability stripping, and overfitting on narrow streams.

Can a single model replace retrieval for long-term conversation memory?

COMEDY merges memory generation, compression, and response into one operation, tracking event recaps, user portraits, and relationship dynamics without vector-DB retrieval. However, empirical work shows continuous reprocessing follows an inverted-U curve, degrading below no-memory baseline due to misgrouping, context loss, and overfitting.

Can agents compress their own memory without losing critical details?

DeepAgent's autonomous memory folding consolidates interaction history into episodic, working, and tool memory schemas. This reduces token overhead while letting agents pause to reconsider strategies—the autonomy and structure together avoid degradation that plagues poorly designed consolidation.

How should agent memory split across time scales?

RAISE shows that agent memory consists of four components organized by two design axes: dialogue-level (conversation history, scratchpad) versus turn-level (examples, task trajectory). This granularity distinction predicts different failure modes and update policies for each component.

Should successful and failed episodes be processed differently?

SkillRL demonstrates that treating successful episodes as concrete demonstrations and failures as abstracted lessons achieves state-of-the-art performance on complex tasks while using substantially less context than uniform approaches. The asymmetry mirrors human expert reasoning and avoids the degradation seen in uniform consolidation methods.

Can agents learn from failure without updating their weights?

Reflexion demonstrates that unambiguous environmental feedback (success/failure) enables agents to write useful self-diagnoses and improve across episodes without parameter updates. The binary signal prevents rationalization, and keeping reflections uncompressed preserves their usability.

Can external managers compress context better than frozen agents?

An external RL-trained manager can adaptively prune context for frozen agents, with the key insight that stronger agents benefit from high-fidelity preservation while weaker agents need aggressive compression to stay reliable.

Can agents learn continuously from experience without updating weights?

AgentFly formalizes agent learning as a Memory-augmented MDP with three memory modules (case, subtask, tool) that enable credit assignment and policy improvement entirely through memory operations. The approach achieved 87.88% on GAIA validation without modifying LLM parameters.

Is long-context bottleneck really about memory or compute?

Research shows the bottleneck is not memory capacity but the compute required to consolidate evicted context into fast weights during offline sleep phases. Performance improves with more consolidation passes, following a test-time scaling pattern on harder reasoning tasks.

Can neural memory modules scale language models beyond attention limits?

Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about agent memory compression from a curated 2023–2026 arXiv library. The question remains open: **How should agents compress episodic interactions into working memory without harmful accumulation?**

What a curated library found — and when (dated claims, not current truth):

• Continuous re-consolidation of textual memory follows an inverted-U curve: agents improve initially, then degrade worse than having no memory (~2026), with failure modes including misgrouping events, stripping applicability conditions, and overfitting to recent streams.
• Structured, typed memory (episodic/working/tool schemas; dialogue- vs. turn-level components) avoids degradation that plagues flat consolidation; compression should happen within types, never across them (~2025–2026).
• Asymmetric compression (keeping successful episodes verbatim, distilling failures into abstract lessons) outperforms uniform consolidation while using less context (~2025).
• Verbal self-reflections stored uncompressed in episodic memory remain diagnostic; squeezing them destroys their value (~2025).
• External, RL-trained context managers adapt compression rate to agent reliability better than self-compression (~2026); compute cost of transforming evicted context is the real bottleneck (~2026).
• Neural memory modules that preferentially store surprising tokens (rather than uniform compression) offer principled selectivity (~2024).

Anchor papers (verify; mind their dates):
• arXiv:2605.12978 — Useful Memories Become Faulty When Continuously Updated by LLMs (2026-05)
• arXiv:2510.21618 — DeepAgent: A General Reasoning Agent with Scalable Toolsets (2025-10)
• arXiv:2605.30785 — Learning Agent-Compatible Context Management for Long-Horizon Tasks (2026-05)
• arXiv:2501.00663 — Titans: Learning to Memorize at Test Time (2024-12)

**Your task:**

(1) **RE-TEST EACH CONSTRAINT.** For every finding above, judge whether newer models (GPT-4o, o3, Claude 4, open-weights competitors), training methods (continued pretraining, online RL, memory-augmented tuning), tooling (memory SDKs, retrieval harnesses), orchestration (multi-agent composition, persistent cache layers), or evaluation (long-horizon benchmarks post-2026) have since relaxed or overturned it. Separate the durable question (likely still open: *what compression strategy avoids both bloat and erasure?*) from the perishable limitation (e.g., *single-model generation+compression is fragile* — has orchestration obsoleted this?). Cite what resolved it; flag what still appears to hold.

(2) **Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months.** Has any paper argued uniform consolidation *can* work under specific conditions (e.g., with recency-weighted loss, gating mechanisms, or larger hidden states)? Has external memory management been overtaken by in-context learning or retrieval-augmented generation?

(3) **Propose 2 research questions that ASSUME the regime may have moved:** e.g., *Do multimodal agents compress video/code episodes differently than text?* or *Can compression be delegated to a frozen, specialized adapter rather than an external RL manager?*

Cite arXiv IDs; flag anything you cannot ground in a real paper.

How should agents compress episodic interactions into working memory without accumulation?

Sources 10 notes

Next inquiring lines