SYNTHESIS NOTE
Agentic Systems and Tool Use

Can agents learn continuously from experience without updating weights?

This explores whether LLM agents can adapt to new tasks and failures by retrieving past experiences from memory alone, rather than requiring expensive parameter fine-tuning or rigid hardcoded rules.

Synthesis note · 2026-02-23 · sourced from Memory
How should we allocate compute budget at inference time? What kind of thing is an LLM really? How should researchers navigate LLM reasoning research?

AgentFly addresses a central challenge: LLM agents either follow rigid hardcoded workflows (inflexible) or require parameter fine-tuning (expensive, impractical for continual adaptation). The alternative: learn continuously through memory, not weight updates.

The formalization is a Memory-augmented Markov Decision Process (M-MDP). The agent stores past trajectories as episodic traces — including both successes and failures — and retrieves similar past experiences to guide current decision-making. This aligns with case-based reasoning (CBR), a psychologically grounded learning strategy: humans often solve problems by recalling analogous past situations.

Three memory modules serve distinct functions:

  1. Case Memory — vectorized storage of prior task trajectories (task, plan, success/failure label). Supports retrieval via similarity-based search or an online-updating Q-function. This is the strategic memory: which approaches worked for which kinds of problems.

  2. Subtask Memory — text-based storage of active subtasks and their execution results. Orchestrates the planner-executor interaction within a single task. This is the working memory: what's being done right now.

  3. Tool Memory — text-based logs of tool interactions scoped per subtask. Records what tools were used, what they returned. This is the procedural memory: how specific operations were executed.

The learning mechanism: credit assignment happens via memory rewriting (updating case labels and Q-values based on outcome), and policy improvement happens via memory reading (retrieving relevant cases that shift the planning distribution). No gradient updates to the LLM — the LLM is a fixed reasoning engine, and adaptation happens entirely through what's retrieved into its context.

The result: top-1 on GAIA validation (87.88% Pass@3) and 79.40% on the test set, in the deep research setting.

Since Can agents learn from failure without updating their weights?, AgentFly provides the formal RL framework for this intuition: the M-MDP formalization shows how credit assignment and policy improvement can operate entirely through memory operations. The Q-function over cases provides a principled retrieval policy that improves with experience, rather than relying on static similarity-based retrieval.

Reweave 2026-05-18 — memory-vs-fine-tuning is not binary; the right architecture is dual-timescale. AgentFly's original framing positioned memory-based adaptation as the alternative to fine-tuning — choose one. Late-2025 evidence reframes this as a false dichotomy. Can agents adapt without pausing service to users? shows that production systems can have BOTH: memory-based adaptation on the fast timescale (zero downtime) AND LoRA fine-tuning during user-inactive windows (no service interruption). MetaClaw's OMLS scheduler monitors sleep hours, keyboard inactivity, and calendar occupancy to identify safe windows for weight updates.

The implication for AgentFly's design: its case bank addresses the fast-timescale adaptation problem, but the underlying LLM policy weights remain static — meaning failures that require new capabilities (not just new cases) cannot be resolved by case-based retrieval alone. A dual-timescale architecture would extend AgentFly with idle-window fine-tuning over the accumulated case bank as training data. The case bank becomes both the working memory (fast retrieval) AND the training dataset (slow weight updates). This is what Does agent memory degrade when continuously consolidated? also points toward — the right architecture preserves raw cases as first-class evidence but uses them deliberately for both retrieval and training, with explicit gating.

The corollary: when memory-based RL is presented as "no fine-tuning needed," that framing is correct for the deployment cost story but incomplete for the capability story. Fine-tuning during idle windows is essentially free in production cost terms, and addresses what memory-only systems cannot.

Inquiring lines that use this note as a source 109

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 9

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
18 direct connections · 105 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

memory-based online reinforcement learning enables continual agent adaptation without fine-tuning through episodic case-based reasoning