How much actionable detail does condensation strip from raw experience?
This explores what gets lost when an agent's full, raw experience is boiled down into summaries or memory — and whether that lost material is the part that actually changes behavior.
This explores what condensation strips out when raw experience gets compressed into summaries — and the corpus has a sharp, almost uncomfortable answer: it strips out exactly the part that drives action. In a study across 10 models and 9 environments, perturbing an agent's raw experience changed its behavior a lot, while perturbing the condensed summary barely registered Why do LLM agents ignore condensed experience summaries?. The summaries weren't just shorter — they had quietly dropped the details specific enough to act on, so the model leaned back on raw context and pretrained priors instead. Condensation didn't compress the signal; it deleted it.
Why does this happen? One clue comes from how feedback decomposes. Natural experience carries two separate things: an evaluative signal (how well did that go?) and a directive one (what specifically should change?) Can scalar rewards capture all the information in agent feedback?. Summarization tends to preserve the evaluative gist — "this approach worked" — while discarding the directive specifics that tell you what to do differently next time. The actionable detail is the directive part, and it's the first casualty of abstraction. The same logic shows up in memory consolidation, which follows an inverted-U: a little helps, but as experience piles up, LLM-consolidated memory starts failing problems it had already solved — 54% of them in one case — through "applicability stripping" and overfitting Does agent memory degrade when continuously consolidated?. Applicability stripping is condensation's core failure named directly: the summary keeps the conclusion but loses the conditions under which it applies.
But — and this is the part you might not expect — compression isn't doomed. The damage seems to come from *naive* condensation, not condensation itself. A reasoning model's raw thinking trace, used as-is, turns out to be a better context compressor than purpose-built methods, because the act of reasoning already selects what matters Can a reasoning model's thinking trace compress context effectively?. Push further and you can *train* compression to keep the actionable parts: reward-driven training that ties compression rate to whether the downstream task still succeeds produces compact traces that beat competitors by 17–23% at 4–8x compression Can thinking traces be made reliably budget-controllable?. The difference is that the objective explicitly punishes throwing away detail that mattered.
The design lesson running across these is about *what* you condense, not *how much*. Step-level confidence filtering catches reasoning breakdowns that whole-trace averaging smooths over Does step-level confidence outperform global averaging for trace filtering? — granularity preserves the failure signal that aggregation erases. And DeepAgent's autonomous memory folding avoids the degradation that plagues other consolidation by sorting interactions into structured episodic, working, and tool schemas rather than mashing them into one prose summary Can agents compress their own memory without losing critical details?. So the answer to "how much does condensation strip?" is: nearly all of the actionable detail, *if* you condense by summarizing toward the gist — but very little, if the condensation is structured, reward-grounded, or done at the granularity where the actionable signal actually lives.
Sources 7 notes
Across 10 LLM models and 9 environments, perturbing raw experience changed agent behavior significantly, while altering condensed experience had minimal effect. Three causes drive this asymmetry: summaries lose critical details, models favor immediate context over retrieved information, and pretrained knowledge reduces reliance on external experience.
Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.
LLM-consolidated textual memory degrades as experience accumulates, eventually performing worse than episodic-only retention. GPT-5.4 failed 54% of previously-solved problems after consolidation, with three mechanisms identified: misgrouping, applicability stripping, and overfitting on narrow streams.
A reasoning model's raw thinking trace, used directly as shortened context, outperforms most dedicated compression methods without requiring specialized modules or compression-specific training. The mechanism that enables reasoning also produces usable input compression.
Reward-driven training that couples compression rate to downstream task quality elicits compact, controllable traces. At 4x and 8x compression, this approach beats competitors by 17–23% F1 and transfers across models.
Local step-level confidence catches reasoning breakdowns that global averaging masks and enables early stopping before traces complete. This approach achieves comparable accuracy gains to naive majority voting with far fewer generated traces, proving trace quality matters more than quantity.
DeepAgent's autonomous memory folding consolidates interaction history into episodic, working, and tool memory schemas. This reduces token overhead while letting agents pause to reconsider strategies—the autonomy and structure together avoid degradation that plagues poorly designed consolidation.