INQUIRING LINE

Do process reward models need different supervision strategies by domain?

This explores whether process reward models (PRMs) — the systems that score a model's intermediate reasoning steps, not just its final answer — need to be built differently depending on the field they're judging, and the corpus suggests the answer splits into two camps: domain-specialized knowledge matters, but the *supervision signal itself* can often be harvested structurally rather than hand-tuned per domain.


This explores whether process reward models need different supervision strategies by domain — and the corpus gives a sharper answer than a flat yes or no. There's a clear case that domain *knowledge* has to be baked in: a finance-specific PRM that integrates expert knowledge bases catches factual and regulatory errors a general PRM sails past, because in finance a step can be logically coherent and still wrong Can general process reward models catch factual errors in finance?. So when the failure modes are domain-specific — a misquoted regulation, a fabricated citation — the supervisor needs to know the domain.

But a second thread argues the *mechanism* for generating step-level signal is surprisingly domain-agnostic. Several methods skip annotated PRMs entirely by mining supervision from the structure of the reasoning itself: tree-search rollouts compare sibling branches to turn a single outcome reward into step-wise preferences Can tree structure alone convert outcome rewards into process supervision?, reverse-curriculum training slides the start state backward to expose where reasoning breaks Can curriculum learning approximate expensive process supervision?, and a broader family exploits trajectory topology, expert-aligned actions, or tool-call positions Can trajectory structure replace hand-annotated process rewards?. These don't ask "what domain is this?" — they ask "what structure does the trajectory have?"

Where the two threads meet is interesting: even the structural methods quietly tailor themselves to the domain's *shape*. Search agents yield process rewards by treating the hardest distractors they read-but-didn't-cite as supervision, a trick that only works because search has that read/cite structure Can search agent behavior yield reliable process rewards for reasoning?. Self-supervised PRMs match expert-annotated performance with no step labels — but the note flags that generalization to "fuzzy-outcome" domains is unproven Can self-supervised process rewards replace human annotation?. That caveat is the whole question in miniature: the cheaper, more universal supervision strategies lean on having a crisp notion of what a correct step looks like, which some domains supply and others don't.

A third move sidesteps the discriminative scoring entirely. Instead of a classifier that labels steps good or bad, generative judges that *reason about* the reasoning outperform classifier rewards with far less training data Can judges that reason about reasoning outperform classifier rewards?, and reward models that spend test-time compute thinking before they score raise the ceiling further Can reward models benefit from reasoning before scoring?. A reasoning judge can in principle adapt its scrutiny to the domain on the fly, rather than being retrained per field — which may be why "different strategy by domain" is becoming less about building N specialized PRMs and more about building one supervisor flexible enough to reason about any of them.

Worth knowing for the curious: not all of supervision is even about scoring quality. Agent feedback decomposes into *evaluative* signal (how good was that?) and *directive* signal (how should it change?), and scalar rewards throw the directive half away Can scalar rewards capture all the information in agent feedback?. So the deeper domain question may not be "how do we grade steps in finance vs. math" but "what kind of signal does this domain's feedback naturally carry, and are we discarding the useful part?"


Sources 9 notes

Can general process reward models catch factual errors in finance?

Fin-PRM, a finance-specific process reward model integrating expert-derived knowledge bases with step and trajectory supervision, outperforms general PRMs on financial tasks by penalizing factual and regulatory errors, not just logical incoherence.

Can tree structure alone convert outcome rewards into process supervision?

Tree-GRPO uses branching structure to transform trajectory-level outcome rewards into step-level preference signals through sibling subtree comparison, eliminating the need for separate process reward models or step-level annotation while scaling with computational budget.

Can curriculum learning approximate expensive process supervision?

R3 progressively slides the reasoning start state backward from near-completion, creating a curriculum that reveals step-level failure modes using only outcome feedback. This achieves process supervision granularity without expensive human step annotations.

Can trajectory structure replace hand-annotated process rewards?

Tree-GRPO, Supervised RL, and ToolPO each convert sparse outcome rewards into dense step signals by exploiting different structural features—tree topology, expert-aligned actions, and tool-call positions—eliminating the need for annotated process reward models.

Can search agent behavior yield reliable process rewards for reasoning?

LongTraceRL mines entity-level reasoning signals from what search agents read but don't cite—the hardest distractors—and applies rubric rewards only to correct answers, structurally blocking reward fabrication while capturing intermediate reasoning quality.

Can self-supervised process rewards replace human annotation?

MetaStone-S1's SPRM achieves o3-mini-level results using dynamic weighting of pseudo-labels instead of human-annotated steps. This eliminates the annotation bottleneck for process supervision, though generalization to fuzzy-outcome domains remains unproven.

Can judges that reason about reasoning outperform classifier rewards?

StepWiser demonstrates that training judges to produce reasoning chains about policy reasoning—rather than classify steps—yields better judgment accuracy and data efficiency. Independent confirmation from GenPRM and ThinkPRM shows generative PRMs outperform discriminative ones with orders of magnitude less training data.

Can reward models benefit from reasoning before scoring?

Three independent teams (RRM, RM-R1, DeepSeek-GRM) discovered that adding chain-of-thought reasoning before reward scoring enables adaptive test-time compute scaling for evaluation. Reasoning-based approaches raise the capability ceiling of reward models beyond what outcome-based evaluation achieves.

Can scalar rewards capture all the information in agent feedback?

Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.

Next inquiring lines