How do pairwise self-judgment and internal belief-shift replace verification differently?
This explores a finding from late-2025 RL research: that two of the moving parts in the RLHF training stack — the reward model and the critic — can each be replaced by a different signal the model already computes internally, and that those two replacements do fundamentally different jobs.
This explores how two recent techniques quietly remove different pieces of the verification machinery in reinforcement learning — and the key is that they're not competing solutions to the same problem, they're filling two different holes. The cleanest map comes from a synthesis showing that verifier-free RL is converging on three substitutable patterns, where each one swaps out a distinct RLHF component Can language models replace reward models with internal signals?. Pairwise self-judgment takes the place of the reward model: instead of training a separate classifier to score outputs, you let the policy itself compare two of its own answers and say which is better. Internal belief-shift takes the place of the critic: a different job entirely.
To see why they're different, it helps to know what each replaced part actually did. A reward model answers 'how good is this finished answer?' — an outcome verdict. A critic answers 'how much did this particular step help?' — credit assignment along the way. Pairwise self-judgment is an outcome verdict the model issues about its own work, the same instinct behind post-completion learning, where a model is trained to evaluate its own output in the unused sequence space after it finishes, internalizing the reward function rather than calling an external one Can models learn to evaluate their own work during training?. Belief-shift works on a totally different axis: ΔBelief-RL watches how the model's own confidence in the target solution moves turn by turn, and uses the log-ratio of those shifting probability estimates as a dense, per-step reward — no critic network, no process reward model Can an agent's own beliefs guide credit assignment without critics?. One produces a single judgment about a whole answer; the other produces a continuous gradient of 'am I getting warmer' signals throughout the reasoning.
That 'getting warmer' signal turns out to have a structural cousin worth knowing about. The deep-thinking ratio measures genuine reasoning by tracking how much a token's prediction gets revised as it passes through the model's layers — and that revision rate correlates with accuracy Can we measure how deeply a model actually reasons?. Both belief-shift and deep-thinking ratio share a premise: the model's internal state is already moving in informative ways during a forward pass, and you can read reward or effort straight off that motion instead of bolting on an external scorer. Belief-shift reads it across turns of a dialogue; deep-thinking ratio reads it across layers of the network. Same bet, different axis.
The deeper reason these two replacements feel different is that self-judgment and belief-tracking may not even be the same kind of internal act. Work on self-recognition shows that a model's explicit verbal report about its own output routes through a mechanistically separate channel from its implicit recognition of that output Do explicit and implicit self-recognition use the same mechanism?. Pairwise self-judgment leans on the explicit, verdict-issuing channel; belief-shift leans on the implicit, confidence-tracking one. And there's a standing caution worth carrying into both: a model's verbalized self-assessments often echo training-data patterns rather than genuine introspection, with reliable introspection appearing only when a real causal chain links the internal state to the report Can language models actually introspect about their own states?. That's quietly an argument in belief-shift's favor — a probability estimate is causally wired to the model's actual computation in a way a free-text 'this answer is better' verdict may not be.
So the short version: pairwise self-judgment replaces verification by having the model grade outcomes, and internal belief-shift replaces it by having the model's own changing confidence assign credit step by step. The first is a verdict; the second is a gradient. Knowing that distinction tells you something non-obvious — that 'getting rid of the verifier' isn't one move but several, depending on which job the verifier was doing.
Sources 6 notes
Late-2025 RL literature independently converges on three patterns that replace different RLHF components: pairwise self-judgment replaces the reward model, internal belief-shift replaces the critic, and rich-feedback self-distillation replaces explicit reward signals. Each emerges from the policy's own computations, making the trained reward classifier optional.
Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.
ΔBelief-RL uses log-ratios of sequential probability estimates to assign per-turn credit without critic networks or process reward models. Tested on 20 Questions, smaller models trained this way matched or exceeded prior SOTA and larger baselines while generalizing beyond training.
Deep-thinking ratio (DTR) measures the proportion of tokens whose predictions undergo significant revision across model layers, correlating robustly with accuracy across AIME, HMMT, and GPQA benchmarks. Think@n, a test-time strategy using DTR, matches self-consistency performance while reducing inference costs.
Models can implicitly recognize their own outputs via entropy collapse and explicitly report authorship when asked, but these abilities do not share a mechanistic substrate. The two channels are neurally independent.
LLM self-reports usually reflect human training distributions rather than actual internal processes. However, when a causal chain connects an internal state to accurate reporting—like inferring low temperature from output consistency—genuine lightweight introspection occurs without requiring consciousness.