What makes reward models fundamentally different from policy discriminators?

This explores the POLAR idea that reward models can be rebuilt as policy discriminators — scoring how close an output is to a target policy rather than judging it against fixed preference labels — and what that reframing reveals about what reward signals actually are.

This explores whether reward models and policy discriminators are really two different things, or whether the usual reward model is just one narrow way of doing policy discrimination. The corpus leans hard toward the second answer. POLAR reframes reward modeling as measuring distance from a target policy: instead of learning absolute 'good vs. bad' preference labels, it scores an output higher the more it resembles a chosen reference policy, and pre-trained discriminators of this kind transfer across tasks far better than label-trained ones Can reward models learn by comparing policies instead of judging them?. So the 'fundamental difference' partly dissolves — a reward model is a discriminator that happens to have absolute judgment baked in, and stripping that out is what buys generalization.

What makes the discriminator framing interesting is how much it lines up with a broader corpus finding: reward signals carry less new information than their scalar form suggests. RLVR studies show that reward-driven training mostly activates strategies already latent in the base model rather than teaching genuinely new reasoning — a single example can trigger the effect, and even spurious rewards work nearly as well for a well-pretrained model What does reward learning actually do to model reasoning?. Pass@k analysis sharpens this: RLVR narrows sampling toward solutions the base model could already reach, while genuine new capability comes from distillation, not reward Does RLVR actually expand what models can reason about?. If reward is largely repositioning the policy inside its own distribution, then 'distance from a target policy' is arguably a more honest description of the job than 'judge of quality.'

The limits of the scalar reward are where the distinction reappears, though. One line of work shows that a single number throws away information a discriminator could keep: natural feedback splits into evaluative signal (how well an action did) and directive signal (how it should change), and scalars capture the first while discarding the second Can scalar rewards capture all the information in agent feedback?. Related approaches turn rich tokenized environment feedback into dense per-token credit, letting the policy act as its own process reward model and making an external scalar reward unnecessary Can environment feedback replace scalar rewards in policy learning?. Binary correctness rewards even actively damage the model — they reward confident wrong answers and wreck calibration unless you bolt on a proper scoring rule Does binary reward training hurt model calibration?. So the contrast isn't reward-model-vs-discriminator so much as scalar-judgment-vs-richer-comparison.

The most surprising adjacent thread is that the reward signal can come from the policy itself rather than from any separate model. An agent's own shifting beliefs toward a solution provide dense intrinsic reward with no critic or reward model at all Can an agent's own beliefs guide credit assignment without critics?; majority voting across samples on unlabeled data manufactures a usable reward with no trained evaluator Can models improve themselves using only majority voting?; and agents can treat the consequences of their own actions as supervision, sidestepping external reward entirely Can agents learn from their own actions without external rewards?. Read together, the corpus suggests the real distinction isn't between two kinds of model but along a spectrum of how much structure the signal carries and where it lives — from an absolute scalar judge, to a relative policy discriminator, to feedback the policy generates about itself. If you want a doorway into the other end of that spectrum, reasoning-based reward models that think before scoring Can reward models benefit from reasoning before scoring? show the judgment side getting richer just as the discriminator side gets leaner.

Sources 10 notes

Can reward models learn by comparing policies instead of judging them?

POLAR reframes reward modeling as policy discrimination: RMs assign higher scores to policies similar to a chosen target, eliminating absolute preference labels. Pre-trained 1.8B-7B parameter POLAR RMs substantially outperform non-pre-trained methods and transfer across task formulations.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Does RLVR actually expand what models can reason about?

Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.

Can scalar rewards capture all the information in agent feedback?

Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.

Can environment feedback replace scalar rewards in policy learning?

SDPO converts tokenized environment feedback into dense gradient signals by using the feedback-conditioned policy as a self-teacher. The policy, when given retrospective evidence of its mistakes in-context, implicitly acts as its own process reward model, making external reward signals unnecessary.

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Can an agent's own beliefs guide credit assignment without critics?

ΔBelief-RL uses log-ratios of sequential probability estimates to assign per-turn credit without critic networks or process reward models. Tested on 20 Questions, smaller models trained this way matched or exceeded prior SOTA and larger baselines while generalizing beyond training.

Can models improve themselves using only majority voting?

Test-Time RL generates reward signals by majority voting across repeated samples, enabling policy improvement without ground-truth labels or trained reward models. This approach works surprisingly well because consensus answers tend to be correct, creating a bootstrapping loop where test-time compute enables training that improves the model.

Can agents learn from their own actions without external rewards?

Research across eight environments shows that agents can use future states from their own actions as supervision without external rewards, matching expert-dependent baselines with half the data and providing superior warm-starts for subsequent RL training.

Can reward models benefit from reasoning before scoring?

Three independent teams (RRM, RM-R1, DeepSeek-GRM) discovered that adding chain-of-thought reasoning before reward scoring enables adaptive test-time compute scaling for evaluation. Reasoning-based approaches raise the capability ceiling of reward models beyond what outcome-based evaluation achieves.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-examining whether reward models and policy discriminators are truly distinct entities, given a curated library of LLM/RL papers (2024–2026). The question remains open: does the distinction hold under newer capability regimes?

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026; treat these as perishable constraints to re-test:

• Reward models are best reframed as policy discriminators measuring distance from a target policy rather than absolute judges; pre-trained discriminators transfer across tasks far better than label-trained ones (2025-07, arXiv:2507.05197).
• Reward-driven training activates strategies already latent in the base model rather than teaching new reasoning; RLVR largely repositions policy within its own distribution, not beyond it (2025-04, arXiv:2504.13837).
• Scalar rewards discard directive signal; richer tokenized feedback enables per-token credit assignment, making external scalar reward unnecessary (2025-06 era).
• Binary correctness rewards actively degrade calibration; proper scoring rules are required to prevent confident wrong answers (2024-09, arXiv:2409.15360).
• Reward signals can originate from the policy itself (belief-shift, majority-vote on unlabeled data, self-distillation) rather than external models (2025–2026 cluster).

Anchor papers (verify; mind their dates):
- arXiv:2507.05197 (2025-07): Pre-Trained Policy Discriminators are General Reward Models
- arXiv:2504.13837 (2025-04): Does RLVR Incentivize Reasoning Beyond Base Model
- arXiv:2505.14674 (2025-05): Reward Reasoning Model
- arXiv:2510.08558 (2025-10): Agent Learning via Early Experience

Your task:
(1) RE-TEST EACH CONSTRAINT. For every claim above, probe whether newer models (o1-scale reasoning, multimodal agents), training methods (reinforced self-training, process rewards at scale), tooling (reward caching, ensemble harnesses), or recent evals have relaxed or overturned it. Separate the durable question—"Is the reward model / policy discriminator boundary real or is it a false dichotomy?"—from perishable limitations (e.g., scalar vs. rich signal). Where does the constraint still hold; where has it dissolved?

(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Does any recent paper restore a fundamental distinction, or push the synthesis further toward self-generated reward?

(3) Propose 2 research questions that ASSUME the regime may have shifted: e.g., "If policy discriminators have become the default, what makes a reward *model* preferable in any regime?" or "Can self-reward systems match external rewards at scale, or is there an irreducible role for an out-of-distribution critic?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

What makes reward models fundamentally different from policy discriminators?

Sources 10 notes

Next inquiring lines