Can episodic memory alone enable learning without parameter updates?
This explores whether an agent can genuinely learn — improve at tasks over time — using only an external memory store, while its underlying model weights stay frozen.
This explores whether an agent can genuinely learn — improve at tasks over time — using only an external memory store, while its underlying model weights stay frozen. The corpus answer is a fairly emphatic yes: several systems show real, measurable improvement with zero parameter updates, and the more interesting finding is that *how* the memory is shaped matters more than the fact of having one.
The clearest existence proofs come from agent systems that route all learning through memory operations. AgentFly reframes learning itself as a problem of credit assignment over memory rather than over weights, and reaches 87.88% on the GAIA benchmark without touching the model Can agents learn continuously from experience without updating weights?. Reflexion shows the simplest version of the loop — an agent that fails, writes itself a verbal note about why, and reads that note next time — and finds that an unambiguous success/failure signal is what keeps those reflections honest rather than rationalized Can agents learn from failure without updating their weights?. So far, episodic memory alone clearly *can* drive learning.
But the corpus immediately complicates 'alone.' The shape of the stored experience turns out to be decisive. Storing memories as causal abstractions — recording not just what happened but the conditions under which an action applies — beats generic reflection by 23 points and, crucially, transfers 4–17 points to environments the agent never trained on Can frozen language models continually improve through memory structure alone?. SkillRL pushes the same idea: successes and failures shouldn't be stored the same way — keep wins as concrete demonstrations, distill losses into abstract lessons Should successful and failed episodes be processed differently?. And VOYAGER shows memory can hold executable skills that compound into harder skills over a lifetime, sidestepping the catastrophic forgetting that weight updates cause Can agents learn new skills without forgetting old ones?. The lesson across these: 'episodic memory' is doing a lot of quiet work — raw logs of past episodes don't learn; structured, abstracted, differentially-processed memory does.
Here's the thing you might not have known to ask: this reframes forgetting itself as an *allocation* problem rather than an unavoidable cost. Fast-Slow Training splits adaptation into slow weights and fast textual context and shows you get equivalent performance faster with far less forgetting, arguing forgetting is misallocation, not destiny Can splitting adaptation into two channels reduce forgetting?. That dovetails with evidence that even when you *do* update weights, RL only touches 5–30% of parameters in a structured, near-identical subnetwork Does reinforcement learning update only a small fraction of parameters? — hinting that much of what looks like 'learning' may be a small, externalizable adjustment that memory can stand in for.
Two deeper notes round out the territory. First, memory-as-learning can emerge without anyone designing it: RL agents provably offload information into their spatial environment, using the world as external memory just by optimizing reward Do RL agents accidentally use environments as memory?. Second, what counts as a useful memory unit matters — in-context learning of sequential decisions needs whole trajectories from the same setting, not isolated examples, to generalize without weight updates Why do trajectories matter more than individual examples for in-context learning?. Together the corpus suggests episodic memory alone *is* enough to learn — but only once you stop treating memory as a transcript and start treating it as a structured, curated artifact.
Sources 9 notes
AgentFly formalizes agent learning as a Memory-augmented MDP with three memory modules (case, subtask, tool) that enable credit assignment and policy improvement entirely through memory operations. The approach achieved 87.88% on GAIA validation without modifying LLM parameters.
Reflexion demonstrates that unambiguous environmental feedback (success/failure) enables agents to write useful self-diagnoses and improve across episodes without parameter updates. The binary signal prevents rationalization, and keeping reflections uncompressed preserves their usability.
Agents using causal-form memory (preserving applicability conditions) outperform generic reflection by 23 points on repeated trials and gain 4-17 points transferring to new environments, showing memory shape matters more than parameter updates.
SkillRL demonstrates that treating successful episodes as concrete demonstrations and failures as abstracted lessons achieves state-of-the-art performance on complex tasks while using substantially less context than uniform approaches. The asymmetry mirrors human expert reasoning and avoids the degradation seen in uniform consolidation methods.
VOYAGER demonstrates that storing executable skills in an embedding-indexed library and composing complex skills from simpler ones allows agents to learn continuously while avoiding the forgetting that occurs with weight-update-based methods. Environmental feedback refines skills while an automatic curriculum drives continual exploration.
Fast-Slow Training routes task-specific lessons into optimized prompts while keeping parameter updates minimal, reaching equivalent performance 1.4–3x faster with substantially less catastrophic forgetting and plasticity loss, demonstrating that forgetting is a misallocation problem rather than an inherent cost.
Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.
Mathematical proof shows that environmental artifacts reduce information needed to represent history in RL agents. Path-following agents naturally develop memory-like behavior through standard reward optimization, satisfying situated cognition criteria without explicit memory objectives.
In-context learning for sequential decision-making requires full or partial trajectories from the same environment level, not just isolated examples. This structural property—trajectory burstiness—allows models to generalize across vastly different tasks without weight updates.