SYNTHESIS NOTE
Agentic Systems and Tool Use Model Architecture and Internals Training, RL, and Test-Time Scaling

Should successful and failed episodes be processed differently?

Explores whether asymmetric treatment of trajectories—preserving successes as full demonstrations while abstracting failures into lessons—could improve both the utility and efficiency of memory in reinforcement learning agents.

Synthesis note · 2026-05-18 · sourced from Reinforcement Learning
How should agents split planning from visual grounding? How does test-time scaling work at the agent level?

Existing memory-based RL methods primarily store raw trajectories. Raw trajectories are token-heavy and noise-saturated; storing them indiscriminately produces context pollution that degrades policy improvement. The alternative — uniform abstraction across all trajectories — destroys the specificity that makes the experience useful.

SkillRL (2602.08234) introduces differential processing as the load-bearing architectural choice. Successful episodes are preserved as full demonstrations — their specific action sequences are exactly what should be reused. Failed episodes are synthesized into concise failure lessons — the specifics of what went wrong don't transfer, but the abstracted lesson does. The asymmetry mirrors how human experts treat experience: remember concrete successes vividly, generalize failures into rules.

The two trajectory types feed a hierarchical SkillBank, partitioned into general skills (universal strategic guidance) and task-specific skills (task-level heuristics). The skill library co-evolves with the agent's policy through recursive failure analysis — each new RL iteration both refines the policy and updates the skill library based on what worked and what didn't.

The differential-processing claim resolves a tension across the agent-memory literature. Does agent memory degrade when continuously consolidated? shows that uniform consolidation regresses below baseline because the consolidation step strips applicability conditions. SkillRL's asymmetric treatment is the proposed fix: preserve raw episodes where the specifics matter (successes), abstract where they don't (failures-as-lessons). This is the third positive case for the condition-preservation hypothesis — alongside ReasoningBank (strategy-level distillation with conditions) and CLIN (causal abstractions preserving "may be necessary"). See ops/tensions/strategy-distillation helps when applicability conditions survive — and hurts when they are stripped.md.

The conceptual move is that abstraction is the right operation for some trajectory types and the wrong operation for others. Treating all experience the same — uniformly raw OR uniformly abstracted — is the failure mode. The right architecture differentiates by trajectory type, with the differentiation being driven by what each type actually contributes to future decision-making.

Empirically, SkillRL achieves state-of-the-art on ALFWorld and WebShop while using substantially less context than raw-trajectory-based memory approaches. The compression comes from the abstraction-of-failures step; the performance comes from preserving the demonstrations-of-successes step. Both halves of the asymmetry are doing work.

Update (2026-05-28) — the topological expression of the success-side operation. FluxMem (2605.28773, "Rethinking Memory as Continuously Evolving Connectivity") performs the differential-processing principle's success-side step as graph topology rather than a skill library. Its Long-Term Consolidation stage clusters recurring successful trajectories and crystallizes them into stable procedural circuits — high-utility pathways that mature (monitored by a convergence metric) so that recurring tasks bypass redundant retrieval and directly activate the mature subgraph. This is SkillRL's "preserve successes as reusable demonstrations" claim recast on a heterogeneous memory graph: where SkillRL stores successful episodes as full demonstrations in a SkillBank, FluxMem stores them as crystallized connections between co-activated units. The convergence is informative — two independently developed systems land on the same operation (durably encode recurring successes for direct reuse) through different data structures, which strengthens the case that the success/failure asymmetry is a structural requirement of self-evolving agent memory, not an artifact of one architecture.

Inquiring lines that use this note as a source 110

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 8

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
13 direct connections · 79 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

recursive skill-augmented RL applies differential processing to trajectories — successful episodes preserved as demonstrations while failures distilled into concise lessons