SYNTHESIS NOTE
Training, RL, and Test-Time Scaling

Can judges that reason about reasoning outperform classifier rewards?

Can process reward models generate explanations about why steps are correct rather than simply classifying them? This explores whether meta-reasoning about reasoning improves both accuracy and generalization in step-level evaluation.

Synthesis note · 2026-02-22 · sourced from Reinforcement Learning
How should we allocate compute budget at inference time?

Current process reward models (PRMs) have two major limitations: they function as black-box classifiers providing scores without explanations, and their reliance on SFT with static datasets limits generalization. StepWiser addresses both by reframing stepwise reward as a reasoning task rather than a classification task.

The architecture has three components. First, self-segmentation: the base policy model learns to segment its own chains-of-thought into coherent "chunks of thought" — each representing a complete logical leap rather than arbitrary step boundaries. This reduces total segments and produces more informative units. Second, chunk annotation: each chunk receives a binary label by comparing outcomes of rollouts starting before and after the chunk. Third, RL training: the judge model is trained via GRPO to produce judgment reasoning chains (reasoning about reasoning) before delivering a verdict.

The self-segmentation is critical. Current methods segment at "Step 1, Step 2" markers or double line breaks, producing fragments that are neither logically complete nor self-contained. StepWiser's segments each serve a single clear objective — setting up an equation, executing a calculation, stating a conclusion. This gives the judge model meaningful units to evaluate.

The meta-reasoning aspect — the judge reasoning about the policy model's reasoning — is what distinguishes this from traditional PRMs. The judge doesn't just classify steps as correct/incorrect; it articulates WHY a step is correct or flawed. Since Can self-supervised process rewards replace human annotation?, StepWiser advances this further by making the reward model generative and explainable.

The practical results: better judgment accuracy on intermediate steps, improved policy model training, and better inference-time search. The approach also connects to the emerging pattern that since Does chain of thought reasoning actually explain model decisions?, having a dedicated judge that explicitly reasons about reasoning quality may be more reliable than relying on the reasoning trace itself.

Dual confirmation from GenPRM and ThinkPRM: Two independent papers reinforce the generative-over-discriminative advantage with striking data efficiency results. GenPRM shows that a 1.5B generative PRM outperforms GPT-4o as a discriminative verifier — the generation objective forces the model to understand why a step is correct or flawed, not just classify it. ThinkPRM demonstrates even more extreme efficiency: using only 1% of the PRM800K dataset beats full-dataset discriminative PRMs, because the reasoning-before-judging approach extracts more signal per training example. Both confirm that process verification benefits from the same "think before judging" principle that makes generative approaches more data-efficient across domains. See Can generative reasoning beat discriminative models with less training data?.

Inquiring lines that use this note as a source 78

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
14 direct connections · 98 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

generative stepwise judges that meta-reason about reasoning steps outperform classifier-based process reward models