Do agents prefer raw experience over condensed summaries of past actions?

This explores whether AI agents actually use the raw, unedited record of what they did before, or whether they lean on tidy summaries of past actions — and the corpus has a surprisingly sharp answer.

This explores whether agents prefer raw experience over condensed summaries of past actions, and the most direct finding in the corpus says: yes, strongly — but not in the way you'd hope. Across 10 models and 9 environments, perturbing an agent's raw experience changed its behavior significantly, while scrambling the condensed summaries barely moved the needle Why do LLM agents ignore condensed experience summaries?. Agents don't just prefer raw experience; they systematically ignore the summaries. Three reasons drive this: summaries lose load-bearing details, models privilege whatever is in immediate context over anything retrieved, and their pretrained knowledge already crowds out external experience. So the 'preference' is partly a failure — the condensed version was supposed to help, and the agent quietly tunes it out.

That reframes the real question from 'raw vs. summary' to 'why does summarization keep breaking?' And here the corpus gets interesting, because compression isn't doomed — it's that most compression is done badly. Continuously consolidating memory follows an inverted-U: helpful at first, then actively harmful, with one system failing 54% of previously-solved problems after consolidation through misgrouping, applicability-stripping, and overfitting Does agent memory degrade when continuously consolidated?. The lesson isn't 'keep everything raw.' It's that naive summarization strips exactly the situational detail that made the experience usable.

The counter-evidence is where you learn something you didn't expect: condensation works when the agent does it itself and keeps structure. Autonomous memory folding lets agents consolidate history into episodic, working, and tool schemas — cutting token overhead while preserving enough to pause and rethink strategy — precisely because the autonomy and structure together avoid the degradation that plagues naive consolidation Can agents compress their own memory without losing critical details?. Going further, storing strategy-level reasoning hints distilled from both successes and failures actually beats raw trajectory storage Can agents learn better from their failures than successes?. So the winning form isn't raw logs and it isn't lossy summaries — it's structured distillation that keeps the 'why' and drops the noise.

Laterally, this connects to a deeper debate about what an agent's feedback even contains. Reducing experience to a scalar reward throws away directive information — not just how well an action did, but how it should change — which is exactly the kind of detail flat summaries also discard Can scalar rewards capture all the information in agent feedback?. And the broader memory literature has stopped treating this as raw-vs-condensed at all: memory decomposes into components at different granularities with different update policies How should agent memory split across time scales?, and a 2025 survey reframes the whole field along forms, functions, and dynamics rather than the tired short-term/long-term split Can three axes replace the short-term long-term memory split?.

The thing worth walking away with: agents preferring raw experience is a symptom, not a design principle. The frontier isn't choosing between firehose and summary — it's building structured, agent-authored memory that compresses the reasoning while keeping the details that summaries habitually destroy. Reliability, in this view, comes from externalizing memory into a well-built harness rather than trusting the model to re-derive it every time Where does agent reliability actually come from?.

Sources 8 notes

Why do LLM agents ignore condensed experience summaries?

Across 10 LLM models and 9 environments, perturbing raw experience changed agent behavior significantly, while altering condensed experience had minimal effect. Three causes drive this asymmetry: summaries lose critical details, models favor immediate context over retrieved information, and pretrained knowledge reduces reliance on external experience.

Does agent memory degrade when continuously consolidated?

LLM-consolidated textual memory degrades as experience accumulates, eventually performing worse than episodic-only retention. GPT-5.4 failed 54% of previously-solved problems after consolidation, with three mechanisms identified: misgrouping, applicability stripping, and overfitting on narrow streams.

Can agents compress their own memory without losing critical details?

DeepAgent's autonomous memory folding consolidates interaction history into episodic, working, and tool memory schemas. This reduces token overhead while letting agents pause to reconsider strategies—the autonomy and structure together avoid degradation that plagues poorly designed consolidation.

Can agents learn better from their failures than successes?

ReasoningBank shows that storing strategy-level reasoning hints from both self-judged successes and failures outperforms success-only memory and raw trajectory storage. Coupled with test-time scaling, memory and compute compound rather than substitute, creating a novel scaling law where accuracy improves through cumulative interaction history.

Can scalar rewards capture all the information in agent feedback?

Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.

How should agent memory split across time scales?

RAISE shows that agent memory consists of four components organized by two design axes: dialogue-level (conversation history, scratchpad) versus turn-level (examples, task trajectory). This granularity distinction predicts different failure modes and update policies for each component.

Can three axes replace the short-term long-term memory split?

A 2025 survey reframes agent memory along forms (token/parametric/latent), functions (factual/experiential/working), and dynamics (formation/evolution/retrieval), showing that short/long-term phenomena emerge from temporal patterns rather than architectural separation. This enables precise system comparison and replaces vague implementation-based claims.

Where does agent reliability actually come from?

Research shows reliable LLM agents externalize three cognitive burdens—memory (state persistence), skills (procedural components), and protocols (structured interaction)—into a harness layer rather than relying on model scale alone. The harness unifies these externalities and eliminates the need for the model to solve the same problems repeatedly.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a capability researcher re-testing constraints on LLM agent memory. The question remains open: Do agents actually prefer raw experience over condensed summaries—and if so, why? What changed?

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026. Here is what was observed:
• Raw experience strongly shifts agent behavior across 10 models and 9 environments; scrambled summaries do not (2026-01, arXiv:2601.22436).
• Naive consolidation follows an inverted-U: summaries help initially, then harm performance, with one system failing 54% of previously-solved problems after consolidation (2025-09, arXiv:2509.25140).
• Agent-authored structural memory (episodic, working, tool schemas) preserves usability while cutting tokens—outperforming both raw logs and lossy summaries (2025-09, arXiv:2509.25140).
• Strategy-level reasoning distilled from successes AND failures beats raw trajectory storage (2025-09, arXiv:2509.25140).
• Memory decomposition into four components at two granularities with separate update policies is more explanatory than raw-vs-condensed framing (2026-01, arXiv:2601.11653).

Anchor papers (verify; mind their dates):
• arXiv:2601.22436 (2026-01): Large Language Model Agents Are Not Always Faithful Self-Evolvers
• arXiv:2509.25140 (2025-09): ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory
• arXiv:2512.13564 (2025-12): Memory in the Age of AI Agents: A Survey — Forms, Functions and Dynamics
• arXiv:2604.08224 (2026-04): Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness E

Your task:
(1) RE-TEST EACH CONSTRAINT. Does the 2026 finding that raw experience outperforms summaries still hold under newer model architectures (e.g., post-o3 reasoning models, multi-modal agents), larger context windows (200K+ tokens), or new memory harnesses (vector stores with hybrid retrievers, structured memory SDKs)? Separately: has anyone shown that agent-authored structure (the 2025-09 finding) generalizes to long-horizon or cross-domain tasks, or does it degrade in new domains? Plainly flag where the constraint STILL appears to hold versus where tooling or training has relaxed it.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Has any recent paper shown that raw experience itself is unnecessary—that sufficiently compressed schema or learned abstractions match or beat raw traces? Or that summary quality has improved enough to close the gap?
(3) Propose 2 research questions that ASSUME the regime may have moved: (a) If structured agent-authored memory now dominates, what properties of that structure are load-bearing? Can you remove elements and identify minimal sufficient schemas? (b) Do agents need raw experience *at all* if summaries are authored by a separate reasoning model rather than the agent itself?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Do agents prefer raw experience over condensed summaries of past actions?

Sources 8 notes

Next inquiring lines