SYNTHESIS NOTE

Why do outcome-based reward models fail at intermediate step evaluation?

Outcome-based reward models (ORMs) evaluate only final results, creating a mismatch with the need to assess reasoning quality at intermediate steps. Understanding this failure mode matters for building better AI reasoning systems.

Synthesis note · 2026-02-22 · sourced from Reasoning Architectures

The Reasoning Language Models Blueprint provides a precise taxonomy of the two primary reward model families and their failure modes:

Outcome-Based Reward Models (ORMs):

Evaluate reasoning solely based on final outcome: P(correct(zT+1) | z0, ..., zT+1)
Training objective is misaligned with intermediate step evaluation — they are trained on final outcomes only
Systematically pessimistic for intermediate steps: a correct intermediate step can look "wrong" if a subsequent error occurs
High false-negative rate: ORMs underestimate solvability of problems from intermediate states
Cannot distinguish between "the chain got lucky" and "the chain reasoned correctly"

Process-Based Reward Models (PRMs):

Evaluate reasoning step-by-step: P(correct(zt) | z0, ..., zt)
Dense rewards enable error localization — can pinpoint which step went wrong
Better alignment with MCTS, which requires per-action evaluation rather than per-trajectory evaluation
Trade-off: require extensive step-level annotations from skilled annotators (expensive), or from LLM-generated annotations (lower quality due to limited self-evaluation capability)

Q-Value models (Q-VMs) vs V-Value models (V-VMs): A further split. Q-VMs evaluate Q(s, a) — expected cumulative reward for taking action a in state s — and are preferred for MCTS because they evaluate edges (actions), not just nodes (states). V-VMs evaluate V(s) — expected cumulative reward from state s — and provide a broader state-level view but less guidance for action selection.

Generative Reward Models (GRMs) as a third category: The RRM and DeepSeek-GRM papers introduce a third family alongside ORMs and PRMs. GRMs harness LLMs to produce interpretable, natural-language feedback rather than scalar scores. They can follow adaptive evaluation instructions, construct synthetic training data, and self-improve through iterative refinement. GRMs unify scoring of single, paired, and multiple responses within pure language representation. However, concerns persist about evaluation reliability — LLMs may produce biased or hallucinated judgments that diverge from human standards. Since Can reward models benefit from reasoning before scoring?, GRMs become most powerful when combined with extended reasoning before judgment.

This taxonomy explains why Can self-supervised process rewards replace human annotation? matters: the annotation cost is the primary bottleneck for PRMs, and self-supervised approaches address precisely this.

The ORM/PRM split is also the reason Can curriculum learning approximate expensive process supervision? is significant — R3 uses outcome supervision only but achieves process-supervision-like step feedback by decomposing the problem curriculum.

Inquiring lines that use this note as a source 5

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 8

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

19 direct connections · 130 in 2-hop network ·medium cluster Open in graph ↗

Why do outcome-based reward models fail at inter… Can self-supervised process rewards replace human … Can curriculum learning approximate expensive proc… Does supervising retrieval steps outperform final … Does failed-step fraction predict reasoning qualit… Can RL agents learn to reason better, not just suc… Can judges that reason about reasoning outperform … Can generative reasoning beat discriminative model… Can we reward reasoning steps without human annota…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can self-supervised process rewards replace human annotation? Self-supervised PRMs learn from outcome labels alone, avoiding expensive step-level annotation. The key question is whether this approach generalizes beyond math and code to domains with ambiguous correctness.
addresses the annotation cost problem this note identifies
Can curriculum learning approximate expensive process supervision? Can a reverse curriculum that slides backward from task completion provide step-level insight comparable to human process annotations, but at outcome supervision cost?
architectural workaround for the ORM/PRM trade-off
Does supervising retrieval steps outperform final answer rewards? Can intermediate feedback on retrieval decisions—which documents to fetch, when to stop—train agentic RAG systems more effectively than rewarding only the final answer? This matters because poor retrieval paths can accidentally succeed or good ones can fail on noisy metrics.
RAG-Gym extends the PRM advantage to agentic retrieval systems
Does failed-step fraction predict reasoning quality better? Can we use the fraction of abandoned reasoning branches to forecast whether a model will solve a problem correctly? This matters because it could guide more efficient test-time scaling than simply adding more tokens.
PRM-detectable signal: failed steps as quality predictor
Can RL agents learn to reason better, not just succeed? Standard outcome-only RL rewards agents for any successful trajectory, even flawed ones. Can we instead train agents to demonstrate genuine reasoning quality by rewarding the metacognitive process itself?
agentic process supervision: RLVMR's programmatic meta-reasoning rewards (planning/exploration/reflection/monitoring) are a domain-specific PRM variant for agentic tasks, providing dense intermediate feedback without human annotation
Can judges that reason about reasoning outperform classifier rewards? Can process reward models generate explanations about why steps are correct rather than simply classifying them? This explores whether meta-reasoning about reasoning improves both accuracy and generalization in step-level evaluation.
resolves the ORM/PRM trade-off differently: StepWiser makes process rewards self-supervised (no annotation cost) AND generative (interpretable reasoning about each step); self-segmentation into chunks-of-thought also addresses the step boundary problem that limits standard PRMs
Can generative reasoning beat discriminative models with less training data? Do process reward models that generate reasoning before judging achieve better performance than traditional discriminative approaches when trained on dramatically smaller datasets? This tests whether generative verification can scale more efficiently.
GenPRM/ThinkPRM collapse the ORM/PRM trade-off: generative PRMs achieve PRM-quality dense step evaluation with ORM-level annotation costs (1% of PRM800K data), because reasoning-before-judging extracts more signal per training example
Can we reward reasoning steps without human annotation? Existing RL for reasoning uses only final-answer rewards, causing models to produce wastefully long chains. Can information theory provide dense, automatic feedback for individual reasoning steps?
a third option in the ORM/PRM taxonomy: L2T provides dense information-theoretic process rewards via PAC-Bayes bounds and Fisher information, annotation-free like ORMs but dense like PRMs; also quantifies the cost of outcome-only training — more than double the needed tokens

Why do outcome-based reward models fail at intermediate step evaluation?

Related concepts in this collection 8

Related papers in this collection 8

Search by related questions 4