SYNTHESIS NOTE
Model Architecture and Internals Training, RL, and Test-Time Scaling Agentic Systems and Tool Use

Can agents learn from failure without updating their weights?

Explores whether language models can improve through trial and error by storing reflections in episodic memory rather than fine-tuning. This matters because it suggests a fundamentally different path to agent adaptation.

Synthesis note · 2026-02-22 · sourced from Reasoning by Reflection
How should we allocate compute budget at inference time? How should researchers navigate LLM reasoning research?

Reflexion demonstrates a specific version of the external-feedback principle at system scale: when an agent has access to unambiguous binary feedback from the environment (success = 1, failure = 0), it can write verbal reflections summarizing what went wrong and how to avoid it. These reflections persist in episodic memory across episodes. The agent improves not through gradient descent but through memory accumulation.

The binary reward design is deliberate and consequential. A richer reward model would allow the agent to rationalize partial performance — finding reasons why a partial failure was acceptable. The binary signal eliminates this: the environment says success or failure, with no room for self-serving gradations. The model must genuinely diagnose what went wrong to write a useful reflection.

Two hallucination types receive precise operational definitions: consecutive identical actions in an environment that responded identically (stuck loop) and trajectories exceeding 30 actions without reaching a successful state (inefficient planning). Both are detectable signatures that trigger termination and reflection, rather than indefinite continuation.

The method requires two components: a heuristic for when to terminate and trigger reflection, and a binary reward signal from the environment. This is a low-data-requirement architecture: no fine-tuning, no labeled training set, just a success/fail signal and the model's ability to generate natural language diagnoses.

The key distinction from internal self-revision: Reflexion's reflection is grounded in actual environmental outcomes, not the model's assessment of its own outputs. This is why it works where internal self-assessment does not. The environment provides an independent ground truth the model cannot rationalize away.

A second reason Reflexion works — visible only in 2025 hindsight. Reflexion writes reflections to episodic memory and retrieves them in subsequent episodes. It does not periodically recompress its reflections into more abstract lessons. Late-2025 evidence makes this design choice load-bearing: Does agent memory degrade when continuously consolidated? shows that LLM-driven consolidation regresses below the no-memory baseline on controlled benchmarks, and Why do LLM agents ignore condensed experience summaries? shows that agents systematically ignore abstracted memory even when it's the only memory provided. Reflexion sidesteps both failure modes because each reflection stays scoped to its triggering episode rather than being merged into a global summary, and because reflections retain enough textual specificity for the agent to use them as raw episodes rather than as condensed heuristics. The architectural simplicity that initially looked like a limitation — no consolidation step, no abstraction pass — turns out to be the property that makes it work.

AgentFly M-MDP formalization (2508.16153): AgentFly extends episodic memory-based learning into a formal RL framework — the Memory-augmented Markov Decision Process (M-MDP). The agent stores past trajectories (successes and failures) in three specialized memory modules: case memory (vectorized prior trajectories with Q-values for retrieval), subtask memory (active tasks and results), and tool memory (per-subtask tool interaction logs). Credit assignment occurs via memory rewriting (updating case labels and Q-values based on outcomes), and policy improvement occurs via memory reading (retrieving relevant cases shifts the planning distribution). The Q-function over cases provides a principled retrieval policy that improves with experience — moving beyond Reflexion's simpler similarity-based episodic retrieval toward learned case selection. AgentFly achieves top-1 on GAIA validation (87.88% Pass@3) in the deep research setting, demonstrating that memory-based RL can match or exceed fine-tuning-based approaches. See Can agents learn continuously from experience without updating weights?.

SDPO as the gradient-based analog (2601.20802): Reflexion converts environment feedback into stored verbal reflections used at the next rollout — a memory-update mechanism. Self-Distillation Policy Optimization (SDPO) converts environment feedback into gradient-distilled improvements to the policy weights — a parameter-update mechanism. Both reject the scalar reward as load-bearing; both treat rich environment signal as already containing the teaching; both leverage the model's in-context retrospection capability (Reflexion: explicit verbal reflection on what went wrong; SDPO: the policy conditioned on feedback as self-teacher). The pair frames a design choice: when environment feedback is rich enough to retrospect on, do you store it as episodic memory (Reflexion) or distill it into weights (SDPO)? Storage avoids parameter changes but accumulates context cost; distillation avoids context cost but commits the update to weights. See Can environment feedback replace scalar rewards in policy learning?.

Inquiring lines that use this note as a source 122

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
19 direct connections · 158 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

verbal reflection stored as episodic memory lets agents learn from trial and error without parameter updates — the environment is the teacher