INQUIRING LINE

How much data do generative process reward models actually need?

This explores how much labeled training data generative process reward models (PRMs that reason before judging each step) actually require — and the corpus answer is: dramatically less than the discriminative classifiers they replace, sometimes almost none.


This explores how much labeled training data generative process reward models actually need — reward models that write out reasoning about each step before scoring it, rather than emitting a bare classifier label. The short answer the corpus keeps returning is: far less than you'd expect, and the gap with older methods is measured in orders of magnitude, not percentages. GenPRM and ThinkPRM reframe process supervision as a generative task where the model thinks before it judges, and the payoff is data efficiency: a 1.5B-parameter GenPRM beats GPT-4o, and ThinkPRM matches or surpasses full-dataset discriminative verifiers using just 1% of the PRM800K labels Can generative reasoning beat discriminative models with less training data?. StepWiser reaches the same conclusion from a different angle — training judges to produce reasoning chains about the policy's steps, rather than to classify them, improves both accuracy and data efficiency Can judges that reason about reasoning outperform classifier rewards?.

Why would reasoning before judging need less data? A clue comes from the parallel finding that reasoning lets reward models scale their effort at evaluation time. Three independent teams (RRM, RM-R1, DeepSeek-GRM) found that adding chain-of-thought before scoring raises the capability ceiling of a reward model beyond what outcome-based scoring achieves Can reward models benefit from reasoning before scoring?. If the model can spend more compute reasoning at inference, it leans less on having memorized a vast labeled dataset during training — capability migrates from data to compute.

The deeper reason may be that much of the 'judgment' ability is already latent in the base model. Work on RLVR dynamics shows reinforcement learning often activates pretrained strategies rather than teaching genuinely new ones — a single training example can suffice to switch on a capability, and even spurious rewards work nearly as well as correct ones when the pretraining is right What does reward learning actually do to model reasoning?. The companion finding that RLVR sharpens sampling efficiency without expanding the underlying reasoning boundary points the same way Does RLVR actually expand what models can reason about?. If a generative PRM is mostly eliciting evaluative skill the model already has, you don't need millions of labels to install it — you need a small, well-chosen set to surface it.

The most radical answer in the corpus is that you may need *zero* hand-annotated process labels. A cluster of methods derives step-level supervision from structure instead of annotation: Tree-GRPO, Supervised RL, and ToolPO convert sparse outcome rewards into dense step signals by exploiting tree topology, expert-aligned actions, or tool-call positions Can trajectory structure replace hand-annotated process rewards?. Tree-search rollouts in particular turn trajectory-level outcomes into step-wise preferences by comparing sibling subtrees, scaling with compute budget rather than labeling budget Can tree structure alone convert outcome rewards into process supervision?. And Post-Completion Learning shows a model can internalize self-evaluation during training using otherwise-unused sequence space, computing its own reward with no external labeled reward model at all Can models learn to evaluate their own work during training?.

So 'how much data?' splits into a sharper question: data for *what*? For generative PRMs that judge with reasoning, ~1% of a conventional process-reward dataset can beat the full thing. Push further and structural or self-generated supervision can replace the annotation entirely — trading a labeling bill for a compute bill. The thing worth knowing you didn't know to ask: the bottleneck in process supervision was never really the volume of labels; it was the format of the judge.


Sources 8 notes

Can generative reasoning beat discriminative models with less training data?

GenPRM and ThinkPRM reframe process supervision as generative tasks with CoT reasoning before judgment, achieving superior performance on far fewer labels. A 1.5B GenPRM beats GPT-4o; ThinkPRM uses only 1% of PRM800K labels to surpass full-dataset discriminative verifiers.

Can judges that reason about reasoning outperform classifier rewards?

StepWiser demonstrates that training judges to produce reasoning chains about policy reasoning—rather than classify steps—yields better judgment accuracy and data efficiency. Independent confirmation from GenPRM and ThinkPRM shows generative PRMs outperform discriminative ones with orders of magnitude less training data.

Can reward models benefit from reasoning before scoring?

Three independent teams (RRM, RM-R1, DeepSeek-GRM) discovered that adding chain-of-thought reasoning before reward scoring enables adaptive test-time compute scaling for evaluation. Reasoning-based approaches raise the capability ceiling of reward models beyond what outcome-based evaluation achieves.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Does RLVR actually expand what models can reason about?

Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.

Can trajectory structure replace hand-annotated process rewards?

Tree-GRPO, Supervised RL, and ToolPO each convert sparse outcome rewards into dense step signals by exploiting different structural features—tree topology, expert-aligned actions, and tool-call positions—eliminating the need for annotated process reward models.

Can tree structure alone convert outcome rewards into process supervision?

Tree-GRPO uses branching structure to transform trajectory-level outcome rewards into step-level preference signals through sibling subtree comparison, eliminating the need for separate process reward models or step-level annotation while scaling with computational budget.

Can models learn to evaluate their own work during training?

Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. This question remains open: **How much labeled data do generative process reward models actually need, and what is the true bottleneck in process supervision?**

What a curated library found — and when (dated claims, not current truth):
These findings span 2024–2025; treat them as perishable constraints to be re-tested.

• GenPRM and ThinkPRM achieve GPT-4o-level or better performance using only ~1% of PRM800K labels by reframing process supervision as generative reasoning before judgment (2025-04, 2025-08).
• Reasoning at test time (Chain-of-Thought in reward evaluation) scales capability ceiling beyond outcome-based scoring, migrating capability from data to inference-time compute (RRM, RM-R1, DeepSeek-GRM, ~2025-05).
• RLVR dynamics show RL often activates latent pretrained strategies rather than teaching new ones; a single example can switch on a capability, suggesting most judgment ability is already in the base model (2025-07).
• Structural supervision methods (Tree-GRPO, Supervised RL, ToolPO, Post-Completion Learning) derive step-level signals from outcome rewards, tree topology, or self-generated reasoning—zero hand-annotated process labels needed (2025-06, 2025-07).
• The bottleneck is judge *format*, not label volume: the format (reasoning-first vs. bare classification) determines data efficiency far more than dataset size.

Anchor papers (verify; mind their dates):
• arXiv:2504.00891 (GenPRM, 2025-04)
• arXiv:2508.19229 (StepWiser, 2025-08)
• arXiv:2507.14843 (RLVR dynamics, 2025-07)
• arXiv:2507.20252 (Post-Completion Learning, 2025-07)

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For every claim above—1% label efficiency, test-time compute scaling, latent-activation dynamics, zero-annotation supervision—check whether newer models (GPT-4.5, o3, or later), training methods, evaluation harnesses, or orchestration (multi-agent loops, memory caching, synthetic data pipelines) have since RELAXED or OVERTURNED it. Separate the durable question (what is the true bottleneck?) from perishable artifacts (1% is the frontier; specific model gaps). Cite what resolved each constraint, plainly saying where it still holds.

(2) **Surface the strongest CONTRADICTING or SUPERSEDING work** from the last ~6 months. Look for papers showing that label volume *does* matter at scale, or that judge format gains saturate, or that base-model latency is not the limiting factor.

(3) **Propose 2 research questions** that assume the regime may have moved: (a) If reasoning-first judges are now standard, what is the next bottleneck—sample efficiency per training step, or generalization to out-of-distribution steps? (b) Can structural supervision + small-scale annotation hybrid methods match or beat outcome-only RL on truly open-ended tasks?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines