SYNTHESIS NOTE

Can language models replace reward models with internal signals?

Recent RL research shows three independent patterns—self-judgment, belief-shift, and rich feedback—that each eliminate a component of the traditional RLHF stack. Are these patterns converging on a fundamentally different architecture for training without external verifiers?

Synthesis note · 2026-05-18 · sourced from Reinforcement Learning

The RLHF-RLVR stack rests on three load-bearing components: a reward signal (preference labels for RLHF, verifiers for RLVR), a reward model trained on that signal, and a policy optimizer (PPO, GRPO) that consumes the RM's output. Each component has scaling problems. Preference labels are expensive and culturally biased. Verifiers exist only for verifiable domains. Reward models suffer from prompt-context blindness, reward hacking, and generalization failure.

Late-2025 RL papers are independently converging on three substitutable patterns that each replace one component without touching the others. Together they suggest the reward-model-as-separate-module is no longer load-bearing — it can be replaced by mechanisms internal to the policy itself.

Pattern one: pairwise self-judgment. Can models learn to judge themselves without external rewards?. The model plays Actor and Judge alternately. Copeland-style ranking of self-generated responses produces the training signal for the Actor; self-consistency on those rankings produces the signal for the Judge. Two channels co-evolving, no external supervision. Replaces: the reward model.

Pattern two: internal belief-shift. Can an agent's own beliefs guide credit assignment without critics?. The change in the agent's own probability assigned to the target solution is the dense intrinsic reward. Log-ratio of sequential beliefs is computed from a single forward pass. Replaces: the critic / PRM.

Pattern three: rich-feedback self-distillation. Can environment feedback replace scalar rewards in policy learning?. Environment feedback (runtime errors, judge text, compile traces) becomes the supervision. The current policy conditioned on feedback serves as the self-teacher. Distill the feedback-informed next-token distribution back into the policy. Replaces: the explicit reward signal.

Each pattern can in principle compose with the others. SERL + ΔBelief gives you self-judgment AND dense intrinsic signal. SDPO + SERL gives you rich feedback AND self-evaluation. The substrate is the same — the language model with appropriate in-context conditioning — and each component performs a different role that the others cannot.

The structural claim: RL is being decomposed into substitutable parts. Pretraining + verifier was one architecture. Pretraining + intrinsic signal is another. Pretraining + self-judgment is a third. None of these requires the reward-model-as-trained-classifier component that defined classical RLHF. The reward model was load-bearing for absolute-preference RLHF; for the verifier-free patterns, it is replaced by mechanisms that emerge from the policy's own computations.

The writing angle worth tracking: if the reward model goes away, what changes about alignment? RLHF was inseparable from a specific architectural commitment — train a reward model to encode human preferences, then optimize against it. Verifier-free RL leaves the preference-encoding question open. Where does alignment come from when the RM is not the locus? Some answers: rich feedback (the environment carries it), self-judgment (the model encodes it), community feedback (citations encode it). The substitutability of mechanisms is also a fragmentation of where alignment lives.

A second worth tracking: this is what learning without supervision looks like when the model is already capable enough to retrospect, judge, and assess its own beliefs. Each pattern leverages an in-context capability of the model. The patterns work because the model is good enough at the relevant in-context task to bootstrap supervision from itself.

Inquiring lines that use this note as a source 34

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 6

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

14 direct connections · 77 in 2-hop network ·medium cluster Open in graph ↗

Can language models replace reward models with i… Can models learn to judge themselves without exter… Can an agent's own beliefs guide credit assignment… Can environment feedback replace scalar rewards in… Can reward models learn by comparing policies inst… Can adversarial critics replace task-specific veri… Can models learn what makes research worth doing?

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can models learn to judge themselves without external rewards? Can a language model train itself by alternating between generating responses and evaluating them using only internal consistency signals? This explores whether evaluation itself can become a learnable skill without external supervision.
pattern one
Can an agent's own beliefs guide credit assignment without critics? Explore whether an agent's shifting probability estimates toward the correct answer could serve as a self-contained reward signal for long-horizon reinforcement learning, eliminating the need for separate process reward models or external verifiers.
pattern two
Can environment feedback replace scalar rewards in policy learning? Can rich tokenized feedback from environments serve as a direct learning signal for policies, without relying on compressed scalar rewards? This matters because scalar rewards discard information needed for credit assignment.
pattern three
Can reward models learn by comparing policies instead of judging them? What if reward models worked as policy discriminators—measuring distance to a target rather than encoding absolute preferences? Could this eliminate the need for manual preference labels and scale across domains?
fourth substitutable pattern (similarity-to-target) that fits the same decomposition
Can adversarial critics replace task-specific verifiers for reasoning? Explores whether an adversarial game between policy and critic can substitute for explicit verifiers in RL-based reasoning training. Matters because many domains lack the task-specific validators that make current reasoning RL possible.
fifth pattern (adversarial IRL against demonstrations); together the five paths form a substitution table
Can models learn what makes research worth doing? Can large language models be trained to recognize high-impact research directions by learning from citation patterns? This explores whether 'scientific taste'—the judgment of what work matters—is a learnable skill separate from execution.
sixth pattern (community signal as reward); demonstrates the substitution principle generalizes to non-individual feedback

Can language models replace reward models with internal signals?

Related concepts in this collection 6

Related papers in this collection 8

Search by related questions 4