Why do outcome-based reward models fail at intermediate step evaluation?
Outcome-based reward models (ORMs) evaluate only final results, creating a mismatch with the need to assess reasoning quality at intermediate steps. Understanding this failure mode matters for building better AI reasoning systems.
The Reasoning Language Models Blueprint provides a precise taxonomy of the two primary reward model families and their failure modes:
Outcome-Based Reward Models (ORMs):
- Evaluate reasoning solely based on final outcome: P(correct(zT+1) | z0, ..., zT+1)
- Training objective is misaligned with intermediate step evaluation — they are trained on final outcomes only
- Systematically pessimistic for intermediate steps: a correct intermediate step can look "wrong" if a subsequent error occurs
- High false-negative rate: ORMs underestimate solvability of problems from intermediate states
- Cannot distinguish between "the chain got lucky" and "the chain reasoned correctly"
Process-Based Reward Models (PRMs):
- Evaluate reasoning step-by-step: P(correct(zt) | z0, ..., zt)
- Dense rewards enable error localization — can pinpoint which step went wrong
- Better alignment with MCTS, which requires per-action evaluation rather than per-trajectory evaluation
- Trade-off: require extensive step-level annotations from skilled annotators (expensive), or from LLM-generated annotations (lower quality due to limited self-evaluation capability)
Q-Value models (Q-VMs) vs V-Value models (V-VMs): A further split. Q-VMs evaluate Q(s, a) — expected cumulative reward for taking action a in state s — and are preferred for MCTS because they evaluate edges (actions), not just nodes (states). V-VMs evaluate V(s) — expected cumulative reward from state s — and provide a broader state-level view but less guidance for action selection.
Generative Reward Models (GRMs) as a third category: The RRM and DeepSeek-GRM papers introduce a third family alongside ORMs and PRMs. GRMs harness LLMs to produce interpretable, natural-language feedback rather than scalar scores. They can follow adaptive evaluation instructions, construct synthetic training data, and self-improve through iterative refinement. GRMs unify scoring of single, paired, and multiple responses within pure language representation. However, concerns persist about evaluation reliability — LLMs may produce biased or hallucinated judgments that diverge from human standards. Since Can reward models benefit from reasoning before scoring?, GRMs become most powerful when combined with extended reasoning before judgment.
This taxonomy explains why Can self-supervised process rewards replace human annotation? matters: the annotation cost is the primary bottleneck for PRMs, and self-supervised approaches address precisely this.
The ORM/PRM split is also the reason Can curriculum learning approximate expensive process supervision? is significant — R3 uses outcome supervision only but achieves process-supervision-like step feedback by decomposing the problem curriculum.
Inquiring lines that use this note as a source 5
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How do outcome and process rewards differ in their treatment of intermediate steps?
- Do outcome-only reward signals miss step-level errors that compound later?
- How do outcome-based and process-based reward models differ in supervision cost?
- What failure modes do imitation and outcome methods each address?
- How does process-based reward differ from outcome-only reward in training?
Related concepts in this collection 8
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can self-supervised process rewards replace human annotation?
Self-supervised PRMs learn from outcome labels alone, avoiding expensive step-level annotation. The key question is whether this approach generalizes beyond math and code to domains with ambiguous correctness.
addresses the annotation cost problem this note identifies
-
Can curriculum learning approximate expensive process supervision?
Can a reverse curriculum that slides backward from task completion provide step-level insight comparable to human process annotations, but at outcome supervision cost?
architectural workaround for the ORM/PRM trade-off
-
Does supervising retrieval steps outperform final answer rewards?
Can intermediate feedback on retrieval decisions—which documents to fetch, when to stop—train agentic RAG systems more effectively than rewarding only the final answer? This matters because poor retrieval paths can accidentally succeed or good ones can fail on noisy metrics.
RAG-Gym extends the PRM advantage to agentic retrieval systems
-
Does failed-step fraction predict reasoning quality better?
Can we use the fraction of abandoned reasoning branches to forecast whether a model will solve a problem correctly? This matters because it could guide more efficient test-time scaling than simply adding more tokens.
PRM-detectable signal: failed steps as quality predictor
-
Can RL agents learn to reason better, not just succeed?
Standard outcome-only RL rewards agents for any successful trajectory, even flawed ones. Can we instead train agents to demonstrate genuine reasoning quality by rewarding the metacognitive process itself?
agentic process supervision: RLVMR's programmatic meta-reasoning rewards (planning/exploration/reflection/monitoring) are a domain-specific PRM variant for agentic tasks, providing dense intermediate feedback without human annotation
-
Can judges that reason about reasoning outperform classifier rewards?
Can process reward models generate explanations about why steps are correct rather than simply classifying them? This explores whether meta-reasoning about reasoning improves both accuracy and generalization in step-level evaluation.
resolves the ORM/PRM trade-off differently: StepWiser makes process rewards self-supervised (no annotation cost) AND generative (interpretable reasoning about each step); self-segmentation into chunks-of-thought also addresses the step boundary problem that limits standard PRMs
-
Can generative reasoning beat discriminative models with less training data?
Do process reward models that generate reasoning before judging achieve better performance than traditional discriminative approaches when trained on dramatically smaller datasets? This tests whether generative verification can scale more efficiently.
GenPRM/ThinkPRM collapse the ORM/PRM trade-off: generative PRMs achieve PRM-quality dense step evaluation with ORM-level annotation costs (1% of PRM800K data), because reasoning-before-judging extracts more signal per training example
-
Can we reward reasoning steps without human annotation?
Existing RL for reasoning uses only final-answer rewards, causing models to produce wastefully long chains. Can information theory provide dense, automatic feedback for individual reasoning steps?
a third option in the ORM/PRM taxonomy: L2T provides dense information-theoretic process rewards via PAC-Bayes bounds and Fisher information, annotation-free like ORMs but dense like PRMs; also quantifies the cost of outcome-only training — more than double the needed tokens
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Reasoning Language Models: A Blueprint
- Test-Time Scaling with Reflective Generative Model
- Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning
- Atom-Searcher: Enhancing Agentic Deep Research via Fine-Grained Atomic Thought Reward
- StepWiser: Stepwise Generative Judges for Wiser Reasoning
- Learning to Think: Information-Theoretic Reinforcement Fine-Tuning for LLMs
- Reward Reasoning Model
- Training Large Language Models for Reasoning through Reverse Curriculum Reinforcement Learning
Original note title
outcome-based reward models are systematically pessimistic for intermediate reasoning steps while process-based models provide dense rewards at high annotation cost