SYNTHESIS NOTE

Can model confidence work as a reward signal for reasoning?

Explores whether using a language model's own confidence scores as training rewards can simultaneously improve reasoning accuracy and restore calibration that standard RLHF damages.

Synthesis note · 2026-02-22 · sourced from Self Refinement Self Consistency Feedback

Reinforcement Learning from Self-Feedback (RLSF) exploits a simple observation: in a well-calibrated model, answer confidence correlates with reasoning quality. By using confidence as the reward signal rather than human preference or external verification, RLSF achieves two things simultaneously that normally trade off:

(i) Restores calibration — confidence becomes predictive of correctness again, after RLHF had degraded it. RLHF optimizes for human preference and fluency, which rewards confident-sounding outputs regardless of accuracy. RLSF reverses this by making the reward explicitly tied to calibrated confidence.

(ii) Strengthens step-by-step reasoning — higher-confidence answer spans tend to come from traces with more coherent reasoning chains. Training to maximize confidence indirectly selects for better reasoning.

The mechanism: a frozen LLM generates multiple CoT solutions for each problem. Confidence is computed per final-answer span. Traces are ranked by this confidence to create a synthetic preference dataset (higher confidence = chosen, lower = rejected). A reward model is trained on these preferences and used for standard RL finetuning.

The key insight is that confidence-as-reward can be inserted as an additional post-training step after standard SFT and RLHF — patching the calibration damage that RLHF introduces without undoing its alignment benefits. This requires no human labels, gold answers, or externally curated rewards.

The human learning parallel is explicit: humans use confidence as an intrinsic reward signal when external feedback is unavailable. Metacognitive monitoring — the ability to track your own certainty — is how humans regulate their own learning without a teacher.

The connection to Does binary reward training hurt model calibration? is complementary: that work adds calibration as an explicit second reward term; RLSF uses calibration itself as the primary reward. Both address the same RLHF-induced calibration degradation from different angles.

The risk is the same as Does self-consistency reliably reward correct answers during training? — confidence and self-consistency are correlated proxies, both vulnerable to the model becoming confidently wrong. But RLSF's emphasis on calibration (making confidence track accuracy) is explicitly designed to resist this — the model is rewarded for being accurately confident, not just confident.

Extensions to general domains via RLPR and INTUITOR: Two RLVR papers extend intrinsic reward signals beyond math to general domains. RLPR (RL from LLM Intrinsic Probability) computes the model's token-level probability of generating a reference answer, using this as reward signal — the model's own knowledge about what constitutes a correct answer replaces external verifiers. INTUITOR goes further: it uses self-certainty as the sole reward signal, computed as the confidence gap between the model's top-choice answer and alternatives. Both extend verifiable-reward RL to domains without rule-based verifiers (medicine, law, open-ended reasoning) — precisely the domains where external verification infrastructure is hardest to build. The convergence with RLSF is notable: all three use the model's internal probability landscape as reward, but RLSF targets calibration restoration, RLPR targets domain extension, and INTUITOR targets complete verifier independence. See Can model confidence alone replace external answer verification?.

Inquiring lines that use this note as a source 194

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 6

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

18 direct connections · 168 in 2-hop network ·dense cluster Open in graph ↗

Can model confidence work as a reward signal for… Does binary reward training hurt model calibration… Does self-consistency reliably reward correct answ… Do users worldwide trust confident AI outputs even… Can model confidence alone replace external answer… Does preference optimization harm conversational u… Can we detect when language models confabulate?

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does binary reward training hurt model calibration? Explores whether the standard correctness-based reward in RL training creates incentives for overconfident predictions, and what structural problem causes calibration to degrade during optimization.
complementary approach: explicit calibration reward term vs calibration as primary reward
Does self-consistency reliably reward correct answers during training? Self-consistency initially correlates with correctness, but as models train on this signal, do they eventually learn to maximize consistency itself rather than accuracy? When does this proxy reward stop working?
RLSF shares the proxy reward structure but explicitly targets calibration to resist the hacking failure mode
Do users worldwide trust confident AI outputs even when wrong? Explores whether the tendency to over-rely on confident language model outputs transcends language and culture. Understanding this pattern is critical for designing safer human-AI interaction across diverse linguistic contexts.
RLSF addresses the upstream cause: if models are better calibrated, user overreliance on confidence signals becomes less dangerous
Can model confidence alone replace external answer verification? Can LLMs use their own certainty signals instead of external verifiers to improve reasoning? This matters for scaling beyond domains where correct answers can be automatically checked.
extends: RLPR/INTUITOR use intrinsic probability for domain extension; RLSF uses confidence for calibration restoration
Does preference optimization harm conversational understanding? Exploring whether RLHF training that rewards confident, complete responses undermines the grounding acts—clarifications, checks, acknowledgments—that actually build shared understanding in dialogue.
RLSF addresses one specific dimension of the alignment tax: RLHF degrades both calibration and conversational grounding; RLSF patches the calibration damage by using confidence as intrinsic reward, showing that some alignment costs are design choices that can be reversed without undoing alignment benefits
Can we detect when language models confabulate? Current uncertainty metrics fail to catch inconsistent outputs that look confident. Could measuring semantic divergence across samples reveal confabulation signals that token-level metrics miss?
RLSF's model confidence and semantic entropy are complementary self-referential uncertainty signals: RLSF uses internal token probabilities to restore calibration during training, while semantic entropy uses sampled output clustering to detect confabulations at inference; both bypass the need for external ground truth

Can model confidence work as a reward signal for reasoning?

Related concepts in this collection 6

Related papers in this collection 8

Search by related questions 4