SYNTHESIS NOTE
Training, RL, and Test-Time Scaling

Can generative reasoning beat discriminative models with less training data?

Do process reward models that generate reasoning before judging achieve better performance than traditional discriminative approaches when trained on dramatically smaller datasets? This tests whether generative verification can scale more efficiently.

Synthesis note · 2026-02-22 · sourced from RLVR
How should researchers navigate LLM reasoning research? What does reward learning actually do to model reasoning?

Process reward models (PRMs) are central to test-time scaling but face three limitations: limited generalization across models and tasks, dependence on scalar value prediction that ignores LLM generative abilities, and inability to scale test-time verification compute. Two converging approaches solve these by reframing process supervision as a generative task.

GenPRM integrates Chain-of-Thought reasoning and code verification before providing judgment for each reasoning step. Using Relative Progress Estimation (RPE) — a relative criterion for label estimation rather than hard labels — and a rationale synthesis framework with code verification, GenPRM achieves strong results with only 23K training examples from MATH. A 1.5B GenPRM outperforms GPT-4o on ProcessBench; a 7B version surpasses Qwen2.5-Math-PRM-72B.

ThinkPRM capitalizes on the inherent reasoning abilities of long CoT models, fine-tuning with as few as 8K synthetic verification chains. Using only 1% of the process labels in PRM800K, ThinkPRM outperforms LLM-as-a-Judge and discriminative verifiers across ProcessBench, MATH-500, and AIME '24. In out-of-domain evaluation (GPQA-Diamond, LiveCodeBench), it surpasses discriminative PRMs trained on the full PRM800K by 8% and 4.5% respectively.

The key structural advantage: generative PRMs uniquely support simultaneous scaling of both generator and verifier compute. Discriminative PRMs output a fixed scalar; generative PRMs can be forced to think longer, producing more thorough verification. Under the same token budget, ThinkPRM scales verification compute more effectively than LLM-as-a-Judge, outperforming it by 7.2% on ProcessBench.

Since Can judges that reason about reasoning outperform classifier rewards?, GenPRM and ThinkPRM provide the strongest evidence and specific mechanisms. Since Can reward models benefit from reasoning before scoring?, generative PRMs establish the paradigm: the verifier should think before judging, just as the generator should think before answering.

Inquiring lines that use this note as a source 30

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
15 direct connections · 98 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

generative process reward models that reason before judging outperform discriminative prms with orders of magnitude less data