INQUIRING LINE

Can agents compress long trajectories without losing critical decision context?

This explores whether agents can shrink long histories of their own actions into something compact while keeping the parts that actually drove decisions — and the corpus suggests the answer depends less on compression ratio than on how the agent structures and triages what it keeps.


This explores whether agents can shrink long histories of their own actions while keeping the parts that actually drove decisions. The corpus's clearest answer is that naive, uniform compression degrades — but structured, selective compression works, and the trick is in the structure rather than the shrinking. DeepAgent's "memory folding" consolidates raw interaction history into separate episodic, working, and tool schemas, which both cuts token overhead and lets the agent pause to reconsider strategy; the autonomy and the schema together are what dodge the degradation that sinks sloppier consolidation Can agents compress their own memory without losing critical details?.

The most pointed insight is that not all trajectory content deserves the same treatment. SkillRL shows you get state-of-the-art results on hard tasks using far less context by processing successes and failures *differently* — successful episodes stay as concrete demonstrations, failures get abstracted into lessons Should successful and failed episodes be processed differently?. That asymmetry is exactly what "critical decision context" means in practice: the decision points worth replaying verbatim aren't the same as the ones worth distilling into a rule. Uniform compression flattens that distinction and loses it.

But there's a tension worth knowing about: other work argues trajectories resist compression for a reason. In-context learning of sequential decisions requires "trajectory burstiness" — the model needs full or partial trajectories from the same environment, not isolated snippets, to generalize Why do trajectories matter more than individual examples for in-context learning?. So compression that breaks a trajectory into disconnected examples destroys the very signal that made it useful. The lesson across these two is that you can compress *within* a coherent trajectory but should be wary of fragmenting *across* them.

A different escape hatch is to not hold everything in the context window at all. The Thread Inference Model restructures reasoning as recursive subtask trees with rule-based KV-cache pruning, sustaining accurate reasoning even while discarding 90% of the cache — effectively compressing by forgetting completed branches whose conclusions are already folded upward Can recursive subtask trees overcome context window limits?. Relatedly, AgentFly treats the whole problem as memory operations over case, subtask, and tool modules, doing credit assignment through memory rather than weights and hitting 87.88% on GAIA — evidence that selective memory *is* the policy, not a lossy copy of it Can agents learn continuously from experience without updating weights?.

The unifying frame, if you want one: reliability comes from externalizing memory, skills, and protocols into a harness rather than cramming them into the model's context Where does agent reliability actually come from?. Compression done right isn't about losing less — it's about moving decision context into a structure where it survives. The thing you didn't know you wanted to know: the failure mode isn't that agents forget too much, it's that they forget *uniformly*, treating a pivotal choice and a routine step as equally compressible.


Sources 6 notes

Can agents compress their own memory without losing critical details?

DeepAgent's autonomous memory folding consolidates interaction history into episodic, working, and tool memory schemas. This reduces token overhead while letting agents pause to reconsider strategies—the autonomy and structure together avoid degradation that plagues poorly designed consolidation.

Should successful and failed episodes be processed differently?

SkillRL demonstrates that treating successful episodes as concrete demonstrations and failures as abstracted lessons achieves state-of-the-art performance on complex tasks while using substantially less context than uniform approaches. The asymmetry mirrors human expert reasoning and avoids the degradation seen in uniform consolidation methods.

Why do trajectories matter more than individual examples for in-context learning?

In-context learning for sequential decision-making requires full or partial trajectories from the same environment level, not just isolated examples. This structural property—trajectory burstiness—allows models to generalize across vastly different tasks without weight updates.

Can recursive subtask trees overcome context window limits?

The Thread Inference Model demonstrates that reasoning structured as recursive subtask trees with rule-based KV cache pruning sustains accurate reasoning beyond context limits, even when manipulating 90% of the cache. This enables single models to replace multi-agent systems by handling full recursive reasoning internally.

Can agents learn continuously from experience without updating weights?

AgentFly formalizes agent learning as a Memory-augmented MDP with three memory modules (case, subtask, tool) that enable credit assignment and policy improvement entirely through memory operations. The approach achieved 87.88% on GAIA validation without modifying LLM parameters.

Where does agent reliability actually come from?

Research shows reliable LLM agents externalize three cognitive burdens—memory (state persistence), skills (procedural components), and protocols (structured interaction)—into a harness layer rather than relying on model scale alone. The harness unifies these externalities and eliminates the need for the model to solve the same problems repeatedly.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question remains open: **Can agents compress long trajectories without losing critical decision context?** A curated library found — across papers from 2023–2026 — that the answer hinges on *structure*, not mere shrinking:

• Naive uniform compression degrades performance; DeepAgent's memory folding into episodic/working/tool schemas preserves decision context while cutting tokens (~2025).
• SkillRL shows asymmetric treatment works: successful episodes stay concrete, failures abstract into lessons — not all trajectory content deserves equal compression (~2026).
• Trajectory burstiness (in-context learning of sequential decisions) requires full or partial trajectories from the same environment; fragmenting breaks the signal (~2023).
• Thread Inference Model uses recursive subtask trees with KV-cache pruning, discarding 90% while sustaining reasoning; AgentFly hits 87.88% on GAIA via memory-based operations (~2025–2026).
• The unifying insight: reliability comes from externalizing memory, skills, protocols into a harness; compression survives when decision context moves into structure, not stays in the model (~2026).

**Anchor papers (verify; mind their dates):**
- arXiv:2312.03801 (Generalization to New Sequential Decision Making Tasks with In-Context Learning, 2023)
- arXiv:2510.21618 (DeepAgent: A General Reasoning Agent with Scalable Toolsets, 2025)
- arXiv:2604.08224 (Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering, 2026)
- arXiv:2605.12978 (Useful Memories Become Faulty When Continuously Updated by LLMs, 2026)

**Your task:**
(1) **RE-TEST EACH CONSTRAINT.** For every finding above, judge whether newer models (e.g., o1, Claude 3.5), training methods (e.g., process reward models, on-policy RL), tooling (e.g., Anthropic Batch API, function calling harnesses), orchestration (memory managers, multi-agent state sharing), or evaluation protocols have since RELAXED or OVERTURNED it. Separate the durable question (likely still open: *what structure best preserves critical decisions?*) from the perishable limitation (possibly resolved by, e.g., longer context windows, better prompting, or smarter caching). Cite what resolved it.
(2) **Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months.** Has anyone shown uniform compression *does* work under specific conditions, or that externalizing memory introduces new failure modes?
(3) **Propose 2 research questions that ASSUME the regime may have moved:** e.g., "Do long-context models (e.g., Claude 200K) reduce the need for trajectory compression altogether?" or "Can learned compression (via LoRA or distillation) outperform rule-based memory externalization?"

**Guardrail:** Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines