Why do agents systematically underuse condensed experience in skill documents?
This explores why LLM agents lean on raw, blow-by-blow experience logs but largely ignore the tidy summaries we distill for them into skill documents — and what the corpus says is actually going on.
This question reads as: when we compress an agent's past interactions into clean skill writeups, why does the agent keep reaching for the messy raw record instead? The most direct evidence is striking — across 10 models and 9 environments, perturbing an agent's raw experience visibly changed its behavior, while scrambling the condensed version barely moved the needle Why do LLM agents ignore condensed experience summaries?. The study names three culprits: summaries quietly drop the details that actually drive decisions, models privilege whatever is sitting in immediate context over anything retrieved, and the model's own pretrained knowledge already covers enough that external advice feels redundant. So the underuse isn't laziness — it's that condensed experience competes badly against three stronger pulls.
The corpus suggests the real fault line is *where and how* the condensing happens. Skills authored offline — written up after the fact, away from the task — strip out exactly the situated cues an agent needs to know when a skill applies. Work on in-loop skill creation makes this concrete: when skill-writing is invoked from inside the agent's own reasoning loop, grounded in the exact task context and immediate feedback, the resulting skills are used and even transfer to other agents with little loss Does creating skills inside the agent loop eliminate mismatches?. The implication runs backward into your question — agents underuse condensed experience partly because the condensing severed it from the runtime context that made it actionable.
There's also a granularity story. Summaries that abstract too aggressively lose their grip, but the right unit of compression survives. Agent Workflow Memory shows that inducing reusable *sub-task* routines — finer-grained than whole-task summaries, with example-specific values abstracted away — produces large gains precisely because the routines stay executable rather than narrative Can agents learn reusable sub-task routines from past experience?. Similarly, memory folding that consolidates into structured schemas (episodic, working, tool) avoids the degradation that plagues sloppy consolidation Can agents compress their own memory without losing critical details?. The pattern across these: condensation works when it preserves structure and executability, and fails when it flattens experience into prose an agent can safely skim past.
Step back and a deeper theme emerges — reliable agent behavior comes from externalizing memory and skills into a harness the model can lean on rather than re-derive Where does agent reliability actually come from?. But externalization only pays off if the agent actually *consults* the external store, and a frozen agent reading static skill docs has weak incentive to. That's why decoupling a trainable curator from a frozen executor matters: a learned curator shifts repositories away from generic, verbose, ignorable additions and toward actionable execution logic the agent will use Can a separate trained curator improve skill libraries better than frozen agents?. Compare this to the older limitation where agents trained only on curated demonstrations stay capped by the curator's imagination, never learning from their own failures Can agents learn beyond what their training data shows? — the same hazard reappears whenever someone hand-writes skill docs the agent had no part in producing.
The thing you might not have expected to learn: the underuse of condensed experience is less a memory-retrieval bug than a design verdict. Agents trust raw experience because it carries the situated detail, lives in immediate context, and was generated in the loop. The fix the corpus keeps pointing at isn't 'summarize better' — it's to keep skills executable, generate them in-context, fold memory into structured rather than narrative form, and let a learning curator decide what's worth keeping.
Sources 7 notes
Across 10 LLM models and 9 environments, perturbing raw experience changed agent behavior significantly, while altering condensed experience had minimal effect. Three causes drive this asymmetry: summaries lose critical details, models favor immediate context over retrieved information, and pretrained knowledge reduces reliance on external experience.
MUSE-Autoskill demonstrates that invoking skill creation from within the agent's reasoning loop grounds new skills in exact task context, immediate feedback, and runtime validation. In-loop skills reach 87.94% task accuracy and transfer to other agents with minimal loss, eliminating the situated context problem of offline authoring.
Agent Workflow Memory induces sub-task routines at finer granularity than full tasks, abstracts example-specific values, and compounds them hierarchically. This produces 24.6% relative gain on Mind2Web and 51.1% on WebArena, with larger gains as train-test gaps widen.
DeepAgent's autonomous memory folding consolidates interaction history into episodic, working, and tool memory schemas. This reduces token overhead while letting agents pause to reconsider strategies—the autonomy and structure together avoid degradation that plagues poorly designed consolidation.
Research shows reliable LLM agents externalize three cognitive burdens—memory (state persistence), skills (procedural components), and protocols (structured interaction)—into a harness layer rather than relying on model scale alone. The harness unifies these externalities and eliminates the need for the model to solve the same problems repeatedly.
SkillOS shows that separating a trainable curator from a frozen executor, grouped by task streams, causes skill repositories to shift from generic verbose additions toward actionable execution logic and cross-task meta-strategies. The trained curator generalizes across different executor backbones and domains.
Agents trained on static expert datasets cannot learn from their own failures or generalize beyond demonstrated scenarios because they never interact with environments during training. Competence is capped by what curators imagined, not by agent capacity.