What makes reward models fundamentally different from policy discriminators?
This explores the POLAR idea that reward models can be rebuilt as policy discriminators — scoring how close an output is to a target policy rather than judging it against fixed preference labels — and what that reframing reveals about what reward signals actually are.
This explores whether reward models and policy discriminators are really two different things, or whether the usual reward model is just one narrow way of doing policy discrimination. The corpus leans hard toward the second answer. POLAR reframes reward modeling as measuring distance from a target policy: instead of learning absolute 'good vs. bad' preference labels, it scores an output higher the more it resembles a chosen reference policy, and pre-trained discriminators of this kind transfer across tasks far better than label-trained ones Can reward models learn by comparing policies instead of judging them?. So the 'fundamental difference' partly dissolves — a reward model is a discriminator that happens to have absolute judgment baked in, and stripping that out is what buys generalization.
What makes the discriminator framing interesting is how much it lines up with a broader corpus finding: reward signals carry less new information than their scalar form suggests. RLVR studies show that reward-driven training mostly activates strategies already latent in the base model rather than teaching genuinely new reasoning — a single example can trigger the effect, and even spurious rewards work nearly as well for a well-pretrained model What does reward learning actually do to model reasoning?. Pass@k analysis sharpens this: RLVR narrows sampling toward solutions the base model could already reach, while genuine new capability comes from distillation, not reward Does RLVR actually expand what models can reason about?. If reward is largely repositioning the policy inside its own distribution, then 'distance from a target policy' is arguably a more honest description of the job than 'judge of quality.'
The limits of the scalar reward are where the distinction reappears, though. One line of work shows that a single number throws away information a discriminator could keep: natural feedback splits into evaluative signal (how well an action did) and directive signal (how it should change), and scalars capture the first while discarding the second Can scalar rewards capture all the information in agent feedback?. Related approaches turn rich tokenized environment feedback into dense per-token credit, letting the policy act as its own process reward model and making an external scalar reward unnecessary Can environment feedback replace scalar rewards in policy learning?. Binary correctness rewards even actively damage the model — they reward confident wrong answers and wreck calibration unless you bolt on a proper scoring rule Does binary reward training hurt model calibration?. So the contrast isn't reward-model-vs-discriminator so much as scalar-judgment-vs-richer-comparison.
The most surprising adjacent thread is that the reward signal can come from the policy itself rather than from any separate model. An agent's own shifting beliefs toward a solution provide dense intrinsic reward with no critic or reward model at all Can an agent's own beliefs guide credit assignment without critics?; majority voting across samples on unlabeled data manufactures a usable reward with no trained evaluator Can models improve themselves using only majority voting?; and agents can treat the consequences of their own actions as supervision, sidestepping external reward entirely Can agents learn from their own actions without external rewards?. Read together, the corpus suggests the real distinction isn't between two kinds of model but along a spectrum of how much structure the signal carries and where it lives — from an absolute scalar judge, to a relative policy discriminator, to feedback the policy generates about itself. If you want a doorway into the other end of that spectrum, reasoning-based reward models that think before scoring Can reward models benefit from reasoning before scoring? show the judgment side getting richer just as the discriminator side gets leaner.
Sources 10 notes
POLAR reframes reward modeling as policy discrimination: RMs assign higher scores to policies similar to a chosen target, eliminating absolute preference labels. Pre-trained 1.8B-7B parameter POLAR RMs substantially outperform non-pre-trained methods and transfer across task formulations.
Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.
Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.
Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.
SDPO converts tokenized environment feedback into dense gradient signals by using the feedback-conditioned policy as a self-teacher. The policy, when given retrospective evidence of its mistakes in-context, implicitly acts as its own process reward model, making external reward signals unnecessary.
Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.
ΔBelief-RL uses log-ratios of sequential probability estimates to assign per-turn credit without critic networks or process reward models. Tested on 20 Questions, smaller models trained this way matched or exceeded prior SOTA and larger baselines while generalizing beyond training.
Test-Time RL generates reward signals by majority voting across repeated samples, enabling policy improvement without ground-truth labels or trained reward models. This approach works surprisingly well because consensus answers tend to be correct, creating a bootstrapping loop where test-time compute enables training that improves the model.
Research across eight environments shows that agents can use future states from their own actions as supervision without external rewards, matching expert-dependent baselines with half the data and providing superior warm-starts for subsequent RL training.
Three independent teams (RRM, RM-R1, DeepSeek-GRM) discovered that adding chain-of-thought reasoning before reward scoring enables adaptive test-time compute scaling for evaluation. Reasoning-based approaches raise the capability ceiling of reward models beyond what outcome-based evaluation achieves.