SYNTHESIS NOTE

Can agents learn continuously from experience without updating weights?

This explores whether LLM agents can adapt to new tasks and failures by retrieving past experiences from memory alone, rather than requiring expensive parameter fine-tuning or rigid hardcoded rules.

Synthesis note · 2026-02-23 · sourced from Memory

AgentFly addresses a central challenge: LLM agents either follow rigid hardcoded workflows (inflexible) or require parameter fine-tuning (expensive, impractical for continual adaptation). The alternative: learn continuously through memory, not weight updates.

The formalization is a Memory-augmented Markov Decision Process (M-MDP). The agent stores past trajectories as episodic traces — including both successes and failures — and retrieves similar past experiences to guide current decision-making. This aligns with case-based reasoning (CBR), a psychologically grounded learning strategy: humans often solve problems by recalling analogous past situations.

Three memory modules serve distinct functions:

Case Memory — vectorized storage of prior task trajectories (task, plan, success/failure label). Supports retrieval via similarity-based search or an online-updating Q-function. This is the strategic memory: which approaches worked for which kinds of problems.
Subtask Memory — text-based storage of active subtasks and their execution results. Orchestrates the planner-executor interaction within a single task. This is the working memory: what's being done right now.
Tool Memory — text-based logs of tool interactions scoped per subtask. Records what tools were used, what they returned. This is the procedural memory: how specific operations were executed.

The learning mechanism: credit assignment happens via memory rewriting (updating case labels and Q-values based on outcome), and policy improvement happens via memory reading (retrieving relevant cases that shift the planning distribution). No gradient updates to the LLM — the LLM is a fixed reasoning engine, and adaptation happens entirely through what's retrieved into its context.

The result: top-1 on GAIA validation (87.88% Pass@3) and 79.40% on the test set, in the deep research setting.

Since Can agents learn from failure without updating their weights?, AgentFly provides the formal RL framework for this intuition: the M-MDP formalization shows how credit assignment and policy improvement can operate entirely through memory operations. The Q-function over cases provides a principled retrieval policy that improves with experience, rather than relying on static similarity-based retrieval.

Reweave 2026-05-18 — memory-vs-fine-tuning is not binary; the right architecture is dual-timescale. AgentFly's original framing positioned memory-based adaptation as the alternative to fine-tuning — choose one. Late-2025 evidence reframes this as a false dichotomy. Can agents adapt without pausing service to users? shows that production systems can have BOTH: memory-based adaptation on the fast timescale (zero downtime) AND LoRA fine-tuning during user-inactive windows (no service interruption). MetaClaw's OMLS scheduler monitors sleep hours, keyboard inactivity, and calendar occupancy to identify safe windows for weight updates.

The implication for AgentFly's design: its case bank addresses the fast-timescale adaptation problem, but the underlying LLM policy weights remain static — meaning failures that require new capabilities (not just new cases) cannot be resolved by case-based retrieval alone. A dual-timescale architecture would extend AgentFly with idle-window fine-tuning over the accumulated case bank as training data. The case bank becomes both the working memory (fast retrieval) AND the training dataset (slow weight updates). This is what Does agent memory degrade when continuously consolidated? also points toward — the right architecture preserves raw cases as first-class evidence but uses them deliberately for both retrieval and training, with explicit gating.

The corollary: when memory-based RL is presented as "no fine-tuning needed," that framing is correct for the deployment cost story but incomplete for the capability story. Fine-tuning during idle windows is essentially free in production cost terms, and addresses what memory-only systems cannot.

Inquiring lines that use this note as a source 109

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 9

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

18 direct connections · 105 in 2-hop network ·medium cluster Open in graph ↗

Can agents learn continuously from experience wi… How does treating LLMs as multi-step agents change… Can agents learn better from their failures than s… Does agent memory degrade when continuously consol… Can agents learn from failure without updating the… Can agents learn new skills without forgetting old… Can careful selection of 78 demos outperform massi… How do agentic AI systems decompose into adaptatio… Can agents adapt without pausing service to users?

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

How does treating LLMs as multi-step agents change what we can optimize? Instead of optimizing single prompt-response pairs, what happens when we model LLM agents as temporally-extended decision processes? The question matters because it shifts what becomes trainable.
AgentFly's M-MDP is one concrete instantiation of the broader POMDP paradigm the Agentic RL survey names — memory-as-RL-target generalizes beyond AgentFly's case-based formulation
Can agents learn better from their failures than successes? Does storing reasoning strategies extracted from both successful and failed experiences improve agent learning compared to tracking only successes or raw trajectories? This matters because failures offer preventative lessons that successes alone cannot teach.
ReasoningBank generalizes AgentFly's case-based approach: AgentFly stores trajectories as cases, ReasoningBank abstracts trajectories into strategies; both reject parameter updates as the learning mechanism but disagree on what gets stored
Does agent memory degrade when continuously consolidated? Can consolidating agent experiences into summaries actually harm long-term performance? Research on ARC-AGI tasks suggests continuous memory updates may reduce capability below the no-memory baseline.
warning relevant to AgentFly's case rewriting: when the rewriting mechanism is itself an LLM consolidation step, the inverted-U applies; AgentFly's similarity-based retrieval over raw cases may be partially safe because it skips abstraction
Can agents learn from failure without updating their weights? Explores whether language models can improve through trial and error by storing reflections in episodic memory rather than fine-tuning. This matters because it suggests a fundamentally different path to agent adaptation.
AgentFly adds M-MDP formalization: credit assignment via memory rewriting, policy improvement via memory reading
Can agents learn new skills without forgetting old ones? Explores whether externalized skill libraries—storing learned behaviors as retrievable code rather than parameter updates—can solve the catastrophic forgetting problem that plagues continual learning systems.
VOYAGER composes skills; AgentFly composes cases. Both achieve continual learning without parameter updates
Can careful selection of 78 demos outperform massive training datasets? Does strategic curation of high-quality demonstrations unlock agentic capability more efficiently than scaling training data? LIMI achieved 73.5% on AgencyBench with 78 samples versus 10K+ samples for competing models, suggesting data quality may matter more than quantity.
AgentFly's case bank grows from experience; the efficiency principle suggests a small number of high-quality cases may suffice
How do agentic AI systems decompose into adaptation paradigms? What are the core dimensions that distinguish different approaches to adapting agents and tools in agentic systems? Understanding this taxonomy could clarify which adaptation strategy fits which problem.
AgentFly is agent-optimized with execution-signaled feedback via memory rewriting
Can agents adapt without pausing service to users? Can deployed LLM agents continuously improve their capabilities while serving users without interruption? This explores whether fast behavioral updates and slow policy learning can coexist across different timescales.
MetaClaw extends AgentFly's single-timescale memory-based adaptation with a second timescale (idle-window LoRA fine-tuning) — addresses what AgentFly cannot: improving the underlying policy weights, not just the retrievable case bank
Should agent memory adapt dynamically based on execution feedback? Can agents improve performance by continuously reshaping memory connections in response to whether tasks succeed or fail, rather than relying on fixed retrieval pipelines? This matters because static memory degrades in changing environments.
exemplifies: FluxMem's execution-feedback link editing is the topological form of adapting memory from outcomes without parameter updates

Can agents learn continuously from experience without updating weights?

Related concepts in this collection 9

Related papers in this collection 8

Search by related questions 4