What drives the choice between storing raw episodes versus abstracted rules?
This explores the design trade-off in agent memory: when should a system keep concrete, full-detail records of what happened (raw episodes) versus distilling them into compact, reusable rules or summaries (abstracted rules)?
This explores the design trade-off in agent memory — keep the full concrete record of what happened, or distill it into compact reusable rules — and the corpus suggests the choice is driven less by storage cost than by a surprising fact: models often don't trust their own abstractions. The sharpest finding is that LLM agents lean heavily on raw experience and quietly ignore condensed summaries Why do LLM agents ignore condensed experience summaries?. Across many models and environments, perturbing the raw trace changed behavior, while perturbing the summary did almost nothing — because compression strips the very details the model needed, and pretrained knowledge already covers the generic lessons a summary tends to capture. So the default bias toward 'abstract to save context' can be self-defeating: you pay to summarize, and the model reads past it.
But raw-everything doesn't scale either, which is where the most interesting answer in the collection lives: don't choose globally, choose per outcome. SkillRL treats successful episodes as concrete demonstrations and failures as abstracted lessons Should successful and failed episodes be processed differently?. The intuition is that a success is worth replaying move-for-move — the exact path is the value — whereas a failure is mostly worth one transferable rule ('don't do this'), and keeping the full failed trajectory just burns context. This asymmetry mirrors how human experts remember, and it beats treating every episode the same way.
The risk on the abstraction side gets named directly by work on evolving context: compress too eagerly and you get 'brevity bias' and context collapse, where each rewrite quietly erases detail until the playbook is hollow Can context playbooks prevent knowledge loss during iteration?. The ACE framework's answer is to grow rules incrementally rather than rewrite-and-summarize, which is really a way of getting abstraction's compactness without paying raw experience's forgetting tax. A related instinct shows up in retrieval, where collapsing procedures into uniform chunks destroys the step-to-step structure that 'how-to' knowledge depends on — logic units keep the prerequisites and the ordering intact instead of flattening them How do logic units preserve procedural coherence better than chunks?.
The deepest reframing is that 'raw vs. abstracted' is a special case of matching representation to task. StructRAG routes each query to whichever structure fits its cognitive demands — a table, a graph, an algorithm, or plain chunks — rather than forcing one format on everything Can routing queries to task-matched structures improve RAG reasoning?. Read that way, the real driver isn't a philosophical preference for concrete or compact memory; it's whether the downstream task needs to *replay a specific path* (favor raw) or *recognize a recurring pattern* (favor a rule) — and the systems that win are the ones that keep both and decide case by case.
What you might not have expected to learn: the binding constraint here is often the model's own reading behavior, not disk or context budget. A summary that's technically correct but loses the load-bearing specifics will be ignored even when it's retrieved — so the question 'how much do we abstract?' is really 'how much can we abstract before the model stops believing it?'
Sources 5 notes
Across 10 LLM models and 9 environments, perturbing raw experience changed agent behavior significantly, while altering condensed experience had minimal effect. Three causes drive this asymmetry: summaries lose critical details, models favor immediate context over retrieved information, and pretrained knowledge reduces reliance on external experience.
SkillRL demonstrates that treating successful episodes as concrete demonstrations and failures as abstracted lessons achieves state-of-the-art performance on complex tasks while using substantially less context than uniform approaches. The asymmetry mirrors human expert reasoning and avoids the degradation seen in uniform consolidation methods.
The ACE framework treats contexts as evolving playbooks using generation-reflection-curation loops rather than full rewrites. This prevents knowledge loss from compression and detail erosion, achieving +10.6% on agentic tasks and +8.6% on finance without labeled supervision.
THREAD replaces chunks with four-part logic units—prerequisite, header, body, linker—enabling dynamic multi-step retrieval for how-to questions. Linkers explicitly navigate between steps and branches, addressing both the semantic-vs-task-relevance gap in embeddings and the sequential dependency loss in chunk-based RAG.
StructRAG demonstrates that selecting knowledge structure type based on query demands—via DPO-trained router choosing among tables, graphs, algorithms, catalogues, and chunks—improves knowledge-intensive reasoning over standard retrieval. The approach grounds this in cognitive load and cognitive fit theory from cognitive science.