Can self-supervised process rewards replace human annotation?

Self-supervised PRMs learn from outcome labels alone, avoiding expensive step-level annotation. The key question is whether this approach generalizes beyond math and code to domains with ambiguous correctness.

Synthesis note · 2026-02-20 · sourced from Test Time Compute

Process Reward Models (PRMs) provide step-level feedback that outperforms outcome-level evaluation for test-time scaling. But training them requires expensive step-level human annotations — a bottleneck that limits scale.

MetaStone-S1's Self-supervised PRM (SPRM) addresses this: it learns process evaluation from outcome labels alone, using a self-supervised dynamic weighting that gives higher weight to steps whose pseudo-labels (the SPRM's own predictions) are consistent with the final answer's correctness. No human annotation of intermediate steps is required.

The result matches OpenAI o3-mini performance with a 32B parameter model — evidence that self-supervised process supervision can work. But the open question is breadth: math and code have clear, verifiable outcomes (right/wrong is unambiguous). Can the same approach work in domains where outcome correctness is fuzzy — reasoning about complex social situations, medical diagnosis, open-ended writing?

The scale argument for SPRMs is strong: if you can eliminate step-level annotation, you can train PRMs on any domain where outcome labels exist. That's a massive expansion of the training data available for process supervision. The question is whether the quality holds.

Supporting evidence for AI evaluation quality from domain summarization: persona-based summarization of healthcare documents (doctor, patient, general public personas) evaluated with GPT-4 as critic achieved good concordance with human-based critiquing of the same summaries. The finding is domain-specific but points in the same direction — AI evaluation can match human judgment quality in structured evaluation tasks, at least when the evaluation criteria are sufficiently well-defined. This suggests the domain generalization question for SPRMs may be more tractable than the open question implies.

Trajectory-aware PRMs: ReasonFlux-PRM identifies a new requirement as reasoning models adopt the trajectory-response output format (a lengthy exploratory thinking trajectory followed by a polished final response). Standard PRMs, trained on final responses, fail to supervise intermediate thinking trajectories because: (1) thinking trajectories contain branching and self-revision that linear final responses don't; (2) thinking trajectories have weaker global coherence across steps. ReasonFlux-PRM adds trajectory-level supervision alongside step-level supervision to handle both components. The upshot: as R1-style models become standard, the PRM training problem bifurcates — you need a PRM that can evaluate both the exploratory trace AND the polished response, not just the latter. Self-supervised approaches must be extended to handle trajectory-response format explicitly.

Inquiring lines that use this note as a source 22

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

17 direct connections · 132 in 2-hop network ·medium cluster Open in graph ↗

Can self-supervised process rewards replace huma… How do internal and external test-time scaling com… Why do outcome-based reward models fail at interme… Can AI systems improve their own learning strategi… Can judges that reason about reasoning outperform …

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

How do internal and external test-time scaling compare? Explores whether test-time scaling approaches fundamentally differ in where compute is spent: during training (internal) versus at inference (external). Understanding this split clarifies the trade-offs in deployment strategy and reasoning capability.
SPRMs are a component of external TTS
Why do outcome-based reward models fail at intermediate step evaluation? Outcome-based reward models (ORMs) evaluate only final results, creating a mismatch with the need to assess reasoning quality at intermediate steps. Understanding this failure mode matters for building better AI reasoning systems.
the foundational trade-off this note resolves: PRMs are superior to ORMs for step evaluation but require expensive annotation; self-supervised PRMs eliminate the annotation bottleneck while preserving the dense reward advantage
Can AI systems improve their own learning strategies? Current self-improvement relies on fixed human-designed loops that break when tasks change. The question is whether agents can develop their own adaptive metacognitive processes instead of depending on human intervention.
self-supervised PRMs advance the metacognitive evaluation component: by learning to evaluate reasoning steps from outcome signals without human annotation, SPRMs move process supervision from extrinsic (human-designed labels) toward intrinsic (model-learned evaluation)
Can judges that reason about reasoning outperform classifier rewards? Can process reward models generate explanations about why steps are correct rather than simply classifying them? This explores whether meta-reasoning about reasoning improves both accuracy and generalization in step-level evaluation.
extends: StepWiser adds generative explanation and self-segmentation on top of self-supervised labeling; the judge reasons about why a step is correct rather than just classifying it, making process rewards both annotation-free and explainable

Can self-supervised process rewards replace human annotation?

Related concepts in this collection 4

Related papers in this collection 8

Search by related questions 4