Can agents learn from failure without updating their weights?

Explores whether language models can improve through trial and error by storing reflections in episodic memory rather than fine-tuning. This matters because it suggests a fundamentally different path to agent adaptation.

Synthesis note · 2026-02-22 · sourced from Reasoning by Reflection

Reflexion demonstrates a specific version of the external-feedback principle at system scale: when an agent has access to unambiguous binary feedback from the environment (success = 1, failure = 0), it can write verbal reflections summarizing what went wrong and how to avoid it. These reflections persist in episodic memory across episodes. The agent improves not through gradient descent but through memory accumulation.

The binary reward design is deliberate and consequential. A richer reward model would allow the agent to rationalize partial performance — finding reasons why a partial failure was acceptable. The binary signal eliminates this: the environment says success or failure, with no room for self-serving gradations. The model must genuinely diagnose what went wrong to write a useful reflection.

Two hallucination types receive precise operational definitions: consecutive identical actions in an environment that responded identically (stuck loop) and trajectories exceeding 30 actions without reaching a successful state (inefficient planning). Both are detectable signatures that trigger termination and reflection, rather than indefinite continuation.

The method requires two components: a heuristic for when to terminate and trigger reflection, and a binary reward signal from the environment. This is a low-data-requirement architecture: no fine-tuning, no labeled training set, just a success/fail signal and the model's ability to generate natural language diagnoses.

The key distinction from internal self-revision: Reflexion's reflection is grounded in actual environmental outcomes, not the model's assessment of its own outputs. This is why it works where internal self-assessment does not. The environment provides an independent ground truth the model cannot rationalize away.

A second reason Reflexion works — visible only in 2025 hindsight. Reflexion writes reflections to episodic memory and retrieves them in subsequent episodes. It does not periodically recompress its reflections into more abstract lessons. Late-2025 evidence makes this design choice load-bearing: Does agent memory degrade when continuously consolidated? shows that LLM-driven consolidation regresses below the no-memory baseline on controlled benchmarks, and Why do LLM agents ignore condensed experience summaries? shows that agents systematically ignore abstracted memory even when it's the only memory provided. Reflexion sidesteps both failure modes because each reflection stays scoped to its triggering episode rather than being merged into a global summary, and because reflections retain enough textual specificity for the agent to use them as raw episodes rather than as condensed heuristics. The architectural simplicity that initially looked like a limitation — no consolidation step, no abstraction pass — turns out to be the property that makes it work.

AgentFly M-MDP formalization (2508.16153): AgentFly extends episodic memory-based learning into a formal RL framework — the Memory-augmented Markov Decision Process (M-MDP). The agent stores past trajectories (successes and failures) in three specialized memory modules: case memory (vectorized prior trajectories with Q-values for retrieval), subtask memory (active tasks and results), and tool memory (per-subtask tool interaction logs). Credit assignment occurs via memory rewriting (updating case labels and Q-values based on outcomes), and policy improvement occurs via memory reading (retrieving relevant cases shifts the planning distribution). The Q-function over cases provides a principled retrieval policy that improves with experience — moving beyond Reflexion's simpler similarity-based episodic retrieval toward learned case selection. AgentFly achieves top-1 on GAIA validation (87.88% Pass@3) in the deep research setting, demonstrating that memory-based RL can match or exceed fine-tuning-based approaches. See Can agents learn continuously from experience without updating weights?.

SDPO as the gradient-based analog (2601.20802): Reflexion converts environment feedback into stored verbal reflections used at the next rollout — a memory-update mechanism. Self-Distillation Policy Optimization (SDPO) converts environment feedback into gradient-distilled improvements to the policy weights — a parameter-update mechanism. Both reject the scalar reward as load-bearing; both treat rich environment signal as already containing the teaching; both leverage the model's in-context retrospection capability (Reflexion: explicit verbal reflection on what went wrong; SDPO: the policy conditioned on feedback as self-teacher). The pair frames a design choice: when environment feedback is rich enough to retrospect on, do you store it as episodic memory (Reflexion) or distill it into weights (SDPO)? Storage avoids parameter changes but accumulates context cost; distillation avoids context cost but commits the update to weights. See Can environment feedback replace scalar rewards in policy learning?.

Inquiring lines that use this note as a source 122

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

19 direct connections · 158 in 2-hop network ·medium cluster Open in graph ↗

Can agents learn from failure without updating t… Does agent memory degrade when continuously consol… Why do LLM agents ignore condensed experience summ… Does revising your own reasoning actually help or … Do models fail worse when their own errors fill th… Does a model improve by arguing with itself?

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does agent memory degrade when continuously consolidated? Can consolidating agent experiences into summaries actually harm long-term performance? Research on ARC-AGI tasks suggests continuous memory updates may reduce capability below the no-memory baseline.
late-2025 empirical case for why Reflexion's *non-consolidation* of reflections is the load-bearing design choice, not the reflection itself
Why do LLM agents ignore condensed experience summaries? LLM agents faithfully learn from raw experience but systematically disregard condensed summaries of the same experience. This study investigates whether the problem lies in how summaries are made, how models process them, or whether models simply don't need them.
convergent finding: Reflexion's raw-episodic reflections survive the faithfulness asymmetry that ignores abstracted lessons
Does revising your own reasoning actually help or hurt? Self-revision in reasoning models often degrades accuracy, while external critique improves it. Understanding what makes revision helpful or harmful could reshape how we design systems that need to correct themselves.
Reflexion is the working prototype of this principle: environment = external critic, binary reward = unambiguous signal
Do models fail worse when their own errors fill the context? As a model's prior mistakes accumulate in context, does subsequent accuracy degrade predictably? And can scaling or architectural changes prevent this self-contamination effect?
Reflexion works against this: episodic memory provides targeted failure analysis rather than accumulating raw error history that amplifies future errors
Does a model improve by arguing with itself? When models revise their own reasoning in response to self-generated criticism, do they converge on better answers or worse ones? And how does that compare to challenge from other models?
Reflexion is an architectural solution to degeneration-of-thought: by grounding reflection in binary environmental outcomes rather than self-assessment, it avoids the pattern where internal self-revision amplifies confidence in wrong answers

Can agents learn from failure without updating their weights?

Related concepts in this collection 5

Related papers in this collection 8

Search by related questions 4