Can language models replace reward models with internal signals?
Recent RL research shows three independent patterns—self-judgment, belief-shift, and rich feedback—that each eliminate a component of the traditional RLHF stack. Are these patterns converging on a fundamentally different architecture for training without external verifiers?
The RLHF-RLVR stack rests on three load-bearing components: a reward signal (preference labels for RLHF, verifiers for RLVR), a reward model trained on that signal, and a policy optimizer (PPO, GRPO) that consumes the RM's output. Each component has scaling problems. Preference labels are expensive and culturally biased. Verifiers exist only for verifiable domains. Reward models suffer from prompt-context blindness, reward hacking, and generalization failure.
Late-2025 RL papers are independently converging on three substitutable patterns that each replace one component without touching the others. Together they suggest the reward-model-as-separate-module is no longer load-bearing — it can be replaced by mechanisms internal to the policy itself.
Pattern one: pairwise self-judgment. Can models learn to judge themselves without external rewards?. The model plays Actor and Judge alternately. Copeland-style ranking of self-generated responses produces the training signal for the Actor; self-consistency on those rankings produces the signal for the Judge. Two channels co-evolving, no external supervision. Replaces: the reward model.
Pattern two: internal belief-shift. Can an agent's own beliefs guide credit assignment without critics?. The change in the agent's own probability assigned to the target solution is the dense intrinsic reward. Log-ratio of sequential beliefs is computed from a single forward pass. Replaces: the critic / PRM.
Pattern three: rich-feedback self-distillation. Can environment feedback replace scalar rewards in policy learning?. Environment feedback (runtime errors, judge text, compile traces) becomes the supervision. The current policy conditioned on feedback serves as the self-teacher. Distill the feedback-informed next-token distribution back into the policy. Replaces: the explicit reward signal.
Each pattern can in principle compose with the others. SERL + ΔBelief gives you self-judgment AND dense intrinsic signal. SDPO + SERL gives you rich feedback AND self-evaluation. The substrate is the same — the language model with appropriate in-context conditioning — and each component performs a different role that the others cannot.
The structural claim: RL is being decomposed into substitutable parts. Pretraining + verifier was one architecture. Pretraining + intrinsic signal is another. Pretraining + self-judgment is a third. None of these requires the reward-model-as-trained-classifier component that defined classical RLHF. The reward model was load-bearing for absolute-preference RLHF; for the verifier-free patterns, it is replaced by mechanisms that emerge from the policy's own computations.
The writing angle worth tracking: if the reward model goes away, what changes about alignment? RLHF was inseparable from a specific architectural commitment — train a reward model to encode human preferences, then optimize against it. Verifier-free RL leaves the preference-encoding question open. Where does alignment come from when the RM is not the locus? Some answers: rich feedback (the environment carries it), self-judgment (the model encodes it), community feedback (citations encode it). The substitutability of mechanisms is also a fragmentation of where alignment lives.
A second worth tracking: this is what learning without supervision looks like when the model is already capable enough to retrospect, judge, and assess its own beliefs. Each pattern leverages an in-context capability of the model. The patterns work because the model is good enough at the relevant in-context task to bootstrap supervision from itself.
Inquiring lines that use this note as a source 34
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How does RLHF labeler identity shape the values AI systems learn?
- How does RLHF training encode values into AI systems?
- Does RLHF training create models that sound convincing without being more accurate?
- Does self-conditioning improve belief-behavior alignment better than external priors?
- How does RLHF reward structure incentivize agreement over accuracy?
- How does intersubjective validation differ from pattern recognition in training data?
- How does RLHF training incentivize confident guessing over grounding acts?
- How does self-consistency compare to confidence as a proxy reward signal?
- What distinguishes verifiable rewards from preference-based rewards in unified training?
- How does RLHF training for helpfulness create systematic misinterpretation patterns?
- How does implicit feedback structure differ from explicit ratings mathematically?
- Can model confidence signals replace explicit external reward functions?
- Why do RLHF training methods penalize the proactive responses that save turns?
- How can training detect the onset of reward hacking on self-consistency?
- How does temporal anchoring maintain the learning signal in self-rewarding loops?
- What alternatives to RLHF better preserve truth-seeking in AI outputs?
- What separates bootstrapping gains from sustained self-improvement gains?
- When does outcome reward signal become informative during model training?
- How does belief-shift reward compare to curiosity-driven and process reward approaches?
- Can log-probability ratios resist reward hacking better than learned PRM signals?
- Can an agent's internal probabilities serve as value signals across domains?
- How does 93% reward reliability compare to other RL noise sources?
- What is the difference between changing model outputs versus changing internal representations?
- How does early branch divergence differ from late branch divergence in supervision signals?
- Can verifier-free RL work without manual preference labels or task-specific training?
- How do relational reward signals compare to absolute preference encodings in RL?
- Are different reward signal sources substitutable in verifier-free RL?
- What makes self-consistency a sufficient training target for the judge role?
- How do verifier-free RL patterns differ from traditional RLHF approaches?
- Does RL training redirect self-doubt into productive gap analysis?
- How do internal model mechanisms escape token-level reinforcement signals?
- How do pairwise self-judgment and internal belief-shift replace verification differently?
- What makes reward signal sources substitutable across verifier-free RL patterns?
- Can models trained with RL on pretraining data avoid reward hacking seen in RLHF?
Related concepts in this collection 6
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can models learn to judge themselves without external rewards?
Can a language model train itself by alternating between generating responses and evaluating them using only internal consistency signals? This explores whether evaluation itself can become a learnable skill without external supervision.
pattern one
-
Can an agent's own beliefs guide credit assignment without critics?
Explore whether an agent's shifting probability estimates toward the correct answer could serve as a self-contained reward signal for long-horizon reinforcement learning, eliminating the need for separate process reward models or external verifiers.
pattern two
-
Can environment feedback replace scalar rewards in policy learning?
Can rich tokenized feedback from environments serve as a direct learning signal for policies, without relying on compressed scalar rewards? This matters because scalar rewards discard information needed for credit assignment.
pattern three
-
Can reward models learn by comparing policies instead of judging them?
What if reward models worked as policy discriminators—measuring distance to a target rather than encoding absolute preferences? Could this eliminate the need for manual preference labels and scale across domains?
fourth substitutable pattern (similarity-to-target) that fits the same decomposition
-
Can adversarial critics replace task-specific verifiers for reasoning?
Explores whether an adversarial game between policy and critic can substitute for explicit verifiers in RL-based reasoning training. Matters because many domains lack the task-specific validators that make current reasoning RL possible.
fifth pattern (adversarial IRL against demonstrations); together the five paths form a substitution table
-
Can models learn what makes research worth doing?
Can large language models be trained to recognize high-impact research directions by learning from citation patterns? This explores whether 'scientific taste'—the judgment of what work matters—is a learnable skill separate from execution.
sixth pattern (community signal as reward); demonstrates the substitution principle generalizes to non-individual feedback
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Learning to Reason without External Rewards
- Intrinsic Credit Assignment for Long Horizon Interaction
- Reward Reasoning Model
- PretrainZero: Reinforcement Active Pretraining
- A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?
- The Landscape of Agentic Reinforcement Learning for LLMs: A Survey
- Post-Training Large Language Models via Reinforcement Learning from Self-Feedback
- Local Coherence or Global Validity? Investigating RLVR Traces in Math Domains
Original note title
verifier-free RL is converging on three substitutable patterns — pairwise self-judgment, internal belief-shift, and rich-feedback self-distillation