Do agents prefer raw experience over condensed summaries of past actions?
This explores whether AI agents actually use the raw, unedited record of what they did before, or whether they lean on tidy summaries of past actions — and the corpus has a surprisingly sharp answer.
This explores whether agents prefer raw experience over condensed summaries of past actions, and the most direct finding in the corpus says: yes, strongly — but not in the way you'd hope. Across 10 models and 9 environments, perturbing an agent's raw experience changed its behavior significantly, while scrambling the condensed summaries barely moved the needle Why do LLM agents ignore condensed experience summaries?. Agents don't just prefer raw experience; they systematically ignore the summaries. Three reasons drive this: summaries lose load-bearing details, models privilege whatever is in immediate context over anything retrieved, and their pretrained knowledge already crowds out external experience. So the 'preference' is partly a failure — the condensed version was supposed to help, and the agent quietly tunes it out.
That reframes the real question from 'raw vs. summary' to 'why does summarization keep breaking?' And here the corpus gets interesting, because compression isn't doomed — it's that most compression is done badly. Continuously consolidating memory follows an inverted-U: helpful at first, then actively harmful, with one system failing 54% of previously-solved problems after consolidation through misgrouping, applicability-stripping, and overfitting Does agent memory degrade when continuously consolidated?. The lesson isn't 'keep everything raw.' It's that naive summarization strips exactly the situational detail that made the experience usable.
The counter-evidence is where you learn something you didn't expect: condensation works when the agent does it itself and keeps structure. Autonomous memory folding lets agents consolidate history into episodic, working, and tool schemas — cutting token overhead while preserving enough to pause and rethink strategy — precisely because the autonomy and structure together avoid the degradation that plagues naive consolidation Can agents compress their own memory without losing critical details?. Going further, storing strategy-level reasoning hints distilled from both successes and failures actually beats raw trajectory storage Can agents learn better from their failures than successes?. So the winning form isn't raw logs and it isn't lossy summaries — it's structured distillation that keeps the 'why' and drops the noise.
Laterally, this connects to a deeper debate about what an agent's feedback even contains. Reducing experience to a scalar reward throws away directive information — not just how well an action did, but how it should change — which is exactly the kind of detail flat summaries also discard Can scalar rewards capture all the information in agent feedback?. And the broader memory literature has stopped treating this as raw-vs-condensed at all: memory decomposes into components at different granularities with different update policies How should agent memory split across time scales?, and a 2025 survey reframes the whole field along forms, functions, and dynamics rather than the tired short-term/long-term split Can three axes replace the short-term long-term memory split?.
The thing worth walking away with: agents preferring raw experience is a symptom, not a design principle. The frontier isn't choosing between firehose and summary — it's building structured, agent-authored memory that compresses the reasoning while keeping the details that summaries habitually destroy. Reliability, in this view, comes from externalizing memory into a well-built harness rather than trusting the model to re-derive it every time Where does agent reliability actually come from?.
Sources 8 notes
Across 10 LLM models and 9 environments, perturbing raw experience changed agent behavior significantly, while altering condensed experience had minimal effect. Three causes drive this asymmetry: summaries lose critical details, models favor immediate context over retrieved information, and pretrained knowledge reduces reliance on external experience.
LLM-consolidated textual memory degrades as experience accumulates, eventually performing worse than episodic-only retention. GPT-5.4 failed 54% of previously-solved problems after consolidation, with three mechanisms identified: misgrouping, applicability stripping, and overfitting on narrow streams.
DeepAgent's autonomous memory folding consolidates interaction history into episodic, working, and tool memory schemas. This reduces token overhead while letting agents pause to reconsider strategies—the autonomy and structure together avoid degradation that plagues poorly designed consolidation.
ReasoningBank shows that storing strategy-level reasoning hints from both self-judged successes and failures outperforms success-only memory and raw trajectory storage. Coupled with test-time scaling, memory and compute compound rather than substitute, creating a novel scaling law where accuracy improves through cumulative interaction history.
Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.
RAISE shows that agent memory consists of four components organized by two design axes: dialogue-level (conversation history, scratchpad) versus turn-level (examples, task trajectory). This granularity distinction predicts different failure modes and update policies for each component.
A 2025 survey reframes agent memory along forms (token/parametric/latent), functions (factual/experiential/working), and dynamics (formation/evolution/retrieval), showing that short/long-term phenomena emerge from temporal patterns rather than architectural separation. This enables precise system comparison and replaces vague implementation-based claims.
Research shows reliable LLM agents externalize three cognitive burdens—memory (state persistence), skills (procedural components), and protocols (structured interaction)—into a harness layer rather than relying on model scale alone. The harness unifies these externalities and eliminates the need for the model to solve the same problems repeatedly.