How do case memory and Q-function updates enable better retrieval decisions over time?
This explores how a retrieval system can learn from accumulated past cases and a value function (Q-function) to make smarter decisions about *when* and *what* to retrieve as it gains experience — and the corpus addresses that territory under several different names rather than one literal 'case memory + Q-learning' paper.
This explores how a retrieval system can get better over time by remembering past situations (case memory) and learning a value estimate for its choices (a Q-function) — essentially treating retrieval as a decision it can improve with experience. No single note in the corpus bolts those two pieces together by name, but several attack the same problem from different angles, and reading them side by side is where the idea comes alive.
The clearest match for the 'Q-function' half is the work that reframes retrieval as a Markov Decision Process. Instead of retrieving on a fixed schedule, the model treats each reasoning step as a choice — pull external knowledge, or trust what it already knows — and learns the value of each choice When should language models retrieve external knowledge versus use internal knowledge?. That learned 'when to retrieve' policy is exactly what a Q-function buys you: fewer wasteful lookups, less noise from irrelevant documents, and a reported ~22% accuracy gain that comes mostly from *not* retrieving when retrieval would hurt. A lighter-weight cousin reaches a similar conclusion without reinforcement learning at all — calibrated token-probability uncertainty lets the model decide when to retrieve using its own self-knowledge, beating more elaborate adaptive schemes at a fraction of the cost Can simple uncertainty estimates beat complex adaptive retrieval?. The lateral lesson: the decision of whether to retrieve is often more valuable to learn than the retrieval itself, and one paper says the model already half-knows the answer.
The 'case memory' half shows up as persistent state carried across retrieval cycles. A stateful memory workspace lets a system accumulate evidence over multiple passes, notice when newly retrieved material contradicts what it gathered earlier, and dig deeper to resolve the conflict — outperforming stateless multi-step retrieval that starts fresh each round Can reasoning systems maintain memory across retrieval cycles?. That's the structural payoff of memory: each retrieval decision is informed by the trace of decisions before it, not made in isolation.
But the corpus also plants a warning flag, and it's the thing you didn't know you wanted to know: accumulating memory does *not* monotonically improve things. Continuously reprocessing and compressing past interactions follows an inverted-U curve — helpful up to a point, then degrading *below* having no memory at all, as misgrouping, lost context, and overfitting compound Can a single model replace retrieval for long-term conversation memory?. So 'better decisions over time' is not guaranteed by piling up cases; a memory that learns the wrong patterns gets confidently worse. This is the same structural fragility that shows up in retrieval failure analysis, where fixed-interval triggering wastes context and the real fixes are architectural rather than incremental tuning Where do retrieval systems fail and why?.
Put together, the corpus sketches a loop the question is reaching for: store cases (memory workspace), learn the value of acting on them (MDP/Q-style policy, or cheap uncertainty as a proxy), and route accordingly — but only if the accumulation stays disciplined enough to avoid the inverted-U trap. Routing queries to the structure that actually fits the task Can routing queries to task-matched structures improve RAG reasoning? is the natural next door to walk through, since a learned retrieval policy is only as good as the choices it's allowed to make.
Sources 6 notes
DeepRAG models each reasoning step as a Markov Decision Process where the model learns when to retrieve versus rely on parametric knowledge. The 21.99% improvement comes from better-targeted retrieval and elimination of noise from unnecessary external knowledge.
Calibrated token-probability uncertainty consistently beats multi-call adaptive retrieval on single-hop tasks and matches performance on multi-hop, using a fraction of the LM and retriever calls. The model's self-knowledge proves more reliable than external heuristics for deciding when to retrieve.
ComoRAG demonstrates that iterative evidence acquisition with a persistent memory workspace outperforms stateless multi-step retrieval by detecting and resolving contradictions through deeper exploration, achieving up to 11% gains on complex queries.
COMEDY merges memory generation, compression, and response into one operation, tracking event recaps, user portraits, and relationship dynamics without vector-DB retrieval. However, empirical work shows continuous reprocessing follows an inverted-U curve, degrading below no-memory baseline due to misgrouping, context loss, and overfitting.
RAG systems fail at three structural levels: adaptive triggering (fixed intervals waste context), semantic-task mismatch (embeddings measure association, not relevance), and mathematical limits (embedding dimension constrains representable document sets). These require fundamentally different retrieval approaches, not tuning.
StructRAG demonstrates that selecting knowledge structure type based on query demands—via DPO-trained router choosing among tables, graphs, algorithms, catalogues, and chunks—improves knowledge-intensive reasoning over standard retrieval. The approach grounds this in cognitive load and cognitive fit theory from cognitive science.