Why do generative reward models produce more interpretable evaluations than scalar scores?
This explores why reward models that *reason in words before judging* — generating critiques, step-by-step verdicts, chains of thought — give you something you can read and act on, where a single scalar score gives you only a number.
This explores why reward models that *reason in words before judging* produce evaluations you can actually inspect, while scalar scores hand you a number with no account of itself. The corpus suggests the answer isn't mainly about transparency as a nice-to-have — it's that a scalar reward is *information-lossy in a specific way*, and generating language recovers what was lost.
The sharpest framing comes from the observation that feedback naturally splits into two kinds of information: *evaluative* (how good was this?) and *directive* (what should change?) Can scalar rewards capture all the information in agent feedback?. A scalar captures the first and throws away the second. So the interpretability gap is really a *content* gap: the number tells you the answer scored 0.3, but not that step four divided by an unverified quantity. Generative judges keep the directive channel alive — and that's why they read as explanations rather than verdicts. The same point shows up from the optimization side: models stuck on a plateau under numerical rewards start improving the moment they're given natural-language critiques, because the numbers 'lack critical information about why failures occur and how to improve' Can natural language feedback overcome numerical reward plateaus?.
What's surprising is that this interpretability doesn't cost performance — it seems to *cause* it. Reframing process supervision as a generative task lets a 1.5B model beat GPT-4o, and lets a verifier match full-dataset discriminative models on 1% of the labels Can generative reasoning beat discriminative models with less training data?. Judges trained to reason *about* the reasoning steps, rather than classify them, are both more accurate and more data-efficient Can judges that reason about reasoning outperform classifier rewards?. And several independent teams found that letting a reward model think before it scores raises its capability ceiling and unlocks test-time compute scaling for evaluation itself Can reward models benefit from reasoning before scoring?. The reasoning trace and the interpretability are the same artifact — you read the judge's work for the same reason the judge is better at the job.
The library also pushes on the boundary of *where* interpretable signal helps versus where it's better kept categorical. One note shows that rubric scores converted into dense rewards invite reward hacking, but the same rubrics used as accept/reject *gates* don't — preserving their crisp, legible meaning while a separate dense signal optimizes inside the valid answers Can rubrics and dense rewards work together without hacking?. On the personalization side, text-based preference summaries condition reward models better than embedding vectors *and* stay readable to the users they describe Can text summaries beat embeddings for personalized reward models? — interpretability and effectiveness moving together again, with the lesson that language carries dimensions a compressed vector silently drops.
The quietly radical thread: once feedback is rich enough to read, the external reward model starts to dissolve. A policy given retrospective evidence of its own mistakes in-context acts as its own process reward model, making a separate scalar scorer unnecessary Can environment feedback replace scalar rewards in policy learning?, and models can internalize self-evaluation in the unused space after their own output Can models learn to evaluate their own work during training?. So the deeper reason generative evaluations are more interpretable may be that interpretability and *usefulness* are the same property viewed from two sides — a number can only rank, but a reason can teach, and a signal that can teach is one a model can eventually learn to produce for itself.
Sources 9 notes
Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.
Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.
GenPRM and ThinkPRM reframe process supervision as generative tasks with CoT reasoning before judgment, achieving superior performance on far fewer labels. A 1.5B GenPRM beats GPT-4o; ThinkPRM uses only 1% of PRM800K labels to surpass full-dataset discriminative verifiers.
StepWiser demonstrates that training judges to produce reasoning chains about policy reasoning—rather than classify steps—yields better judgment accuracy and data efficiency. Independent confirmation from GenPRM and ThinkPRM shows generative PRMs outperform discriminative ones with orders of magnitude less training data.
Three independent teams (RRM, RM-R1, DeepSeek-GRM) discovered that adding chain-of-thought reasoning before reward scoring enables adaptive test-time compute scaling for evaluation. Reasoning-based approaches raise the capability ceiling of reward models beyond what outcome-based evaluation achieves.
DRO shows that using rubrics to accept or reject rollout groups—rather than converting rubric scores into dense rewards—prevents reward hacking. This separation preserves the categorical strength of rubrics while letting token-level rewards optimize within valid answers.
PLUS trains summarizers and reward models jointly, learning that text-based preference summaries capture dimensions zero-shot summaries miss. These summaries transfer to GPT-4 for zero-shot personalization and remain interpretable to users.
SDPO converts tokenized environment feedback into dense gradient signals by using the feedback-conditioned policy as a self-teacher. The policy, when given retrospective evidence of its mistakes in-context, implicitly acts as its own process reward model, making external reward signals unnecessary.
Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.