Does supervising retrieval steps outperform final answer rewards?
Can intermediate feedback on retrieval decisions—which documents to fetch, when to stop—train agentic RAG systems more effectively than rewarding only the final answer? This matters because poor retrieval paths can accidentally succeed or good ones can fail on noisy metrics.
Agentic RAG systems must make sequences of retrieval decisions — which query to issue next, which documents to process, when to stop retrieving. Training these systems on final answer accuracy alone (outcome-only reward) evaluates the end result without supervising the path. Poor intermediate retrieval decisions can accidentally produce correct final answers; good decisions can be penalized by noisy evaluation metrics.
RAG-Gym demonstrates that fine-grained process supervision — providing reward signals for individual intermediate retrieval steps, not just the final answer — substantially boosts agentic RAG performance. The improvement comes from two directions: correct retrieval steps are explicitly rewarded, and incorrect steps (retrieving irrelevant documents, issuing redundant queries) are explicitly penalized.
Three post-training algorithms were compared: PPO, DPO, and online DPO. DPO with both positive and negative feedback significantly outperforms PPO and single-direction training. The mechanism: DPO trains the model to prefer good retrieval chains over bad ones by directly contrasting them. Providing negative examples (what a bad intermediate step looks like) gives the model a gradient direction that outcome-only reward cannot supply.
The parallel to reasoning: Does failed-step fraction predict reasoning quality better? shows that in reasoning chains, intermediate step quality predicts final quality better than global features. RAG-Gym shows the same at the agentic level: retrieval step quality determines answer quality better than final-answer reward alone can capture.
Inquiring lines that use this note as a source 38
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How does uncertainty-gated retrieval compare to continuous retrieval efficiency?
- What makes reranking during retrieval better than catching failures at plan time?
- How do outcome and process rewards differ in their treatment of intermediate steps?
- Why does retrieval quality sometimes conflict with final answer quality?
- How does process supervision relate to execution-signaled feedback approaches?
- What makes trajectory more actionable than absolute scores for human moderators?
- How do retrieval systems handle feedback expressed as negations rather than preferences?
- Should retrieval be triggered always or only for difficult questions?
- How much does agent performance depend on demonstration quantity versus curation quality?
- Could eliminating retrieval entirely work better than shifting the burden?
- Can step-level rewards improve training of agentic retrieval systems?
- What makes process-level supervision better than outcome-only reward signals?
- How does process-focused feedback compare to outcome-focused feedback in skill training?
- Why do more detailed rating systems sometimes improve learning from reviews?
- What makes process-level supervision better than outcome-only rewards for RAG training?
- Can other RAG hyperparameters like chunk size be learned through generator feedback?
- Can generator feedback backpropagate through the entire retrieval pipeline?
- How do outcome-based and process-based reward models differ in supervision cost?
- How do RAG and prompting techniques differ in supporting each granularity level?
- How does proactive information-gathering capability differ from passive knowledge retrieval?
- Can RAG systems game user preferences by adding irrelevant citations?
- How do task stream groupings provide long-horizon learning signals for curation decisions?
- How do composite rewards attribute curation outcomes to specific skill library changes?
- Can external retrieval signals outperform internal self-assessment during revision?
- How does tree-search topology convert outcome rewards into intermediate supervision?
- How should retrieval systems decide when to fetch new information?
- What threshold combinations for uncertainty and rarity signals maximize RAG performance?
- How do confidence thresholds compare to learned policies for triggering retrieval?
- What role does document reranking play alongside decisions about whether to retrieve?
- How do token-level rewards and rubric gates serve different statistical functions?
- What five requirements do enterprise RAG systems need beyond accuracy?
- Can adaptive retrieval triggered by model uncertainty improve RAG reliability?
- How do agents decide when to stop and reflect on failure?
- How does machine feedback enable discovery at test time?
- What role does retrieval mechanism design play in forecast accuracy?
- Why does externalizing bookkeeping raise effective feedback compute?
- Can retrieval systems decide when to retrieve instead of always querying?
- Do information gathering and task execution require different incentive structures?
Related concepts in this collection 7
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does failed-step fraction predict reasoning quality better?
Can we use the fraction of abandoned reasoning branches to forecast whether a model will solve a problem correctly? This matters because it could guide more efficient test-time scaling than simply adding more tokens.
same principle at the reasoning level; intermediate step quality predicts outcome quality; the insight transfers from reasoning chains to retrieval chains
-
Does RL improve domain reasoning by adding knowledge or removing it?
When reinforcement learning improves reasoning in specialized domains like medicine, is it teaching models new facts or preventing them from using wrong ones? Understanding this distinction matters for how we design RL training.
RL refines the path, not just the endpoint; process-level supervision is a more direct version of this principle
-
Can RL agents learn to reason better, not just succeed?
Standard outcome-only RL rewards agents for any successful trajectory, even flawed ones. Can we instead train agents to demonstrate genuine reasoning quality by rewarding the metacognitive process itself?
parallel agentic process supervision: RLVMR provides programmatic meta-reasoning rewards (planning/exploration/reflection/monitoring) for agentic navigation; RAG-Gym provides step-level retrieval rewards for agentic search; both demonstrate that outcome-only RL reinforces flawed trajectories in agentic settings
-
Can we reward reasoning steps without human annotation?
Existing RL for reasoning uses only final-answer rewards, causing models to produce wastefully long chains. Can information theory provide dense, automatic feedback for individual reasoning steps?
L2T provides the information-theoretic framework explaining why process rewards outperform outcome-only: per-episode information gain quantifies each step's contribution to correctness, which is exactly what outcome-only reward cannot supply; the theoretical grounding for RAG-Gym's empirical finding
-
Can document count be learned instead of fixed in RAG?
Standard RAG systems use a fixed number of documents regardless of query complexity. Can an RL agent learn to dynamically select both how many documents and their order based on what helps the generator produce correct answers?
complementary RL in RAG: DynamicRAG learns what to include (document selection), RAG-Gym learns how to retrieve (step quality); both use generator output as reward signal
-
When should language models retrieve external knowledge versus use internal knowledge?
Can we model retrieval as a per-step decision problem rather than an always-on strategy? This matters because unnecessary retrieval adds noise and latency without improving accuracy.
shared MDP framing: DeepRAG learns per-step retrieve-or-not decisions, RAG-Gym supervises the quality of retrieval steps; DeepRAG optimizes the when, RAG-Gym optimizes the how
-
Why do outcome-based reward models fail at intermediate step evaluation?
Outcome-based reward models (ORMs) evaluate only final results, creating a mismatch with the need to assess reasoning quality at intermediate steps. Understanding this failure mode matters for building better AI reasoning systems.
RAG-Gym is a domain-specific validation of the ORM/PRM trade-off: outcome-only reward in retrieval creates the same false-negative problem (correct intermediate retrieval penalized by later errors) that ORMs exhibit in reasoning; process-level supervision provides the dense step-feedback that PRMs enable
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- RAG-Gym: Systematic Optimization of Language Agents for Retrieval-Augmented Generation
- Let’s Verify Step by Step
- Atom-Searcher: Enhancing Agentic Deep Research via Fine-Grained Atomic Thought Reward
- Training Large Language Models for Reasoning through Reverse Curriculum Reinforcement Learning
- UR2: Unify RAG and Reasoning through Reinforcement Learning
- Checklists Are Better Than Reward Models For Aligning Language Models
- Retrieval-augmented reasoning with lean language models
- Towards Agentic RAG with Deep Reasoning: A Survey of RAG-Reasoning Systems in LLMs
Original note title
process-level supervision substantially outperforms outcome-only reward for training agentic rag systems