INQUIRING LINE

Why do agents systematically underuse condensed experience in skill documents?

This explores why LLM agents lean on raw, blow-by-blow experience logs but largely ignore the tidy summaries we distill for them into skill documents — and what the corpus says is actually going on.


This question reads as: when we compress an agent's past interactions into clean skill writeups, why does the agent keep reaching for the messy raw record instead? The most direct evidence is striking — across 10 models and 9 environments, perturbing an agent's raw experience visibly changed its behavior, while scrambling the condensed version barely moved the needle Why do LLM agents ignore condensed experience summaries?. The study names three culprits: summaries quietly drop the details that actually drive decisions, models privilege whatever is sitting in immediate context over anything retrieved, and the model's own pretrained knowledge already covers enough that external advice feels redundant. So the underuse isn't laziness — it's that condensed experience competes badly against three stronger pulls.

The corpus suggests the real fault line is *where and how* the condensing happens. Skills authored offline — written up after the fact, away from the task — strip out exactly the situated cues an agent needs to know when a skill applies. Work on in-loop skill creation makes this concrete: when skill-writing is invoked from inside the agent's own reasoning loop, grounded in the exact task context and immediate feedback, the resulting skills are used and even transfer to other agents with little loss Does creating skills inside the agent loop eliminate mismatches?. The implication runs backward into your question — agents underuse condensed experience partly because the condensing severed it from the runtime context that made it actionable.

There's also a granularity story. Summaries that abstract too aggressively lose their grip, but the right unit of compression survives. Agent Workflow Memory shows that inducing reusable *sub-task* routines — finer-grained than whole-task summaries, with example-specific values abstracted away — produces large gains precisely because the routines stay executable rather than narrative Can agents learn reusable sub-task routines from past experience?. Similarly, memory folding that consolidates into structured schemas (episodic, working, tool) avoids the degradation that plagues sloppy consolidation Can agents compress their own memory without losing critical details?. The pattern across these: condensation works when it preserves structure and executability, and fails when it flattens experience into prose an agent can safely skim past.

Step back and a deeper theme emerges — reliable agent behavior comes from externalizing memory and skills into a harness the model can lean on rather than re-derive Where does agent reliability actually come from?. But externalization only pays off if the agent actually *consults* the external store, and a frozen agent reading static skill docs has weak incentive to. That's why decoupling a trainable curator from a frozen executor matters: a learned curator shifts repositories away from generic, verbose, ignorable additions and toward actionable execution logic the agent will use Can a separate trained curator improve skill libraries better than frozen agents?. Compare this to the older limitation where agents trained only on curated demonstrations stay capped by the curator's imagination, never learning from their own failures Can agents learn beyond what their training data shows? — the same hazard reappears whenever someone hand-writes skill docs the agent had no part in producing.

The thing you might not have expected to learn: the underuse of condensed experience is less a memory-retrieval bug than a design verdict. Agents trust raw experience because it carries the situated detail, lives in immediate context, and was generated in the loop. The fix the corpus keeps pointing at isn't 'summarize better' — it's to keep skills executable, generate them in-context, fold memory into structured rather than narrative form, and let a learning curator decide what's worth keeping.


Sources 7 notes

Why do LLM agents ignore condensed experience summaries?

Across 10 LLM models and 9 environments, perturbing raw experience changed agent behavior significantly, while altering condensed experience had minimal effect. Three causes drive this asymmetry: summaries lose critical details, models favor immediate context over retrieved information, and pretrained knowledge reduces reliance on external experience.

Does creating skills inside the agent loop eliminate mismatches?

MUSE-Autoskill demonstrates that invoking skill creation from within the agent's reasoning loop grounds new skills in exact task context, immediate feedback, and runtime validation. In-loop skills reach 87.94% task accuracy and transfer to other agents with minimal loss, eliminating the situated context problem of offline authoring.

Can agents learn reusable sub-task routines from past experience?

Agent Workflow Memory induces sub-task routines at finer granularity than full tasks, abstracts example-specific values, and compounds them hierarchically. This produces 24.6% relative gain on Mind2Web and 51.1% on WebArena, with larger gains as train-test gaps widen.

Can agents compress their own memory without losing critical details?

DeepAgent's autonomous memory folding consolidates interaction history into episodic, working, and tool memory schemas. This reduces token overhead while letting agents pause to reconsider strategies—the autonomy and structure together avoid degradation that plagues poorly designed consolidation.

Where does agent reliability actually come from?

Research shows reliable LLM agents externalize three cognitive burdens—memory (state persistence), skills (procedural components), and protocols (structured interaction)—into a harness layer rather than relying on model scale alone. The harness unifies these externalities and eliminates the need for the model to solve the same problems repeatedly.

Can a separate trained curator improve skill libraries better than frozen agents?

SkillOS shows that separating a trainable curator from a frozen executor, grouped by task streams, causes skill repositories to shift from generic verbose additions toward actionable execution logic and cross-task meta-strategies. The trained curator generalizes across different executor backbones and domains.

Can agents learn beyond what their training data shows?

Agents trained on static expert datasets cannot learn from their own failures or generalize beyond demonstrated scenarios because they never interact with environments during training. Competence is capped by what curators imagined, not by agent capacity.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about why agents underuse condensed experience in skill documents. The question remains open: what drives agents to prefer raw experience over curated summaries?

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026. A library identified three mechanical reasons for underuse across 10 models and 9 environments (2026): summaries strip situated cues needed to decide *when* a skill applies; immediate context dominates over retrieved knowledge; pretrained model knowledge makes external advice feel redundant. Critically, agents DO use condensed skills when they are (a) generated in-loop rather than offline, grounded in runtime feedback (2026); (b) fine-grained sub-task routines, not narrative summaries (2024); (c) structured into episodic/working/tool schemas rather than flattened prose (2026). A learned curator decoupled from a frozen executor shifts skill repositories toward actionable logic (2026), whereas hand-authored skill docs incur the same hazard as frozen expert demonstrations — they lock agents to their curator's imagination (2026).

Anchor papers (verify; mind their dates):
- arXiv:2601.22436 *Large Language Model Agents Are Not Always Faithful Self-Evolvers* (2026-01)
- arXiv:2604.08224 *Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness E* (2026-04)
- arXiv:2605.06614 *SkillOS: Learning Skill Curation for Self-Evolving Agents* (2026-05)
- arXiv:2409.07429 *Agent Workflow Memory* (2024-09)

Your task:
(1) RE-TEST EACH CONSTRAINT. For the three mechanical barriers (cue-stripping, context bias, knowledge redundancy), check whether recent work on prompt engineering, multi-hop retrieval, or model scaling has relaxed them. Separately test whether in-loop generation, fine granularity, and learned curation remain necessary — or whether newer harness designs (memory systems, tool chaining, RAG improvements) have made offline skill documents viable again. State plainly what still holds and what may have shifted.
(2) Surface the strongest *disagreement* in the last 6 months: does any recent paper argue that raw experience *shouldn't* dominate, or that agents *do* reliably use summarized knowledge when it's packaged differently? Cite it.
(3) Propose 2 research questions that assume the regime may have moved: (a) If learned curation now reliably produces usable skill documents, does the agent still need access to raw experience, or is that a legacy safety valve? (b) Can a sub-task routine representation survive model switches (e.g., trained on Claude, deployed on o1), or does executability remain model-specific?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines