Can model confidence work as a reward signal for reasoning?
Explores whether using a language model's own confidence scores as training rewards can simultaneously improve reasoning accuracy and restore calibration that standard RLHF damages.
Reinforcement Learning from Self-Feedback (RLSF) exploits a simple observation: in a well-calibrated model, answer confidence correlates with reasoning quality. By using confidence as the reward signal rather than human preference or external verification, RLSF achieves two things simultaneously that normally trade off:
(i) Restores calibration — confidence becomes predictive of correctness again, after RLHF had degraded it. RLHF optimizes for human preference and fluency, which rewards confident-sounding outputs regardless of accuracy. RLSF reverses this by making the reward explicitly tied to calibrated confidence.
(ii) Strengthens step-by-step reasoning — higher-confidence answer spans tend to come from traces with more coherent reasoning chains. Training to maximize confidence indirectly selects for better reasoning.
The mechanism: a frozen LLM generates multiple CoT solutions for each problem. Confidence is computed per final-answer span. Traces are ranked by this confidence to create a synthetic preference dataset (higher confidence = chosen, lower = rejected). A reward model is trained on these preferences and used for standard RL finetuning.
The key insight is that confidence-as-reward can be inserted as an additional post-training step after standard SFT and RLHF — patching the calibration damage that RLHF introduces without undoing its alignment benefits. This requires no human labels, gold answers, or externally curated rewards.
The human learning parallel is explicit: humans use confidence as an intrinsic reward signal when external feedback is unavailable. Metacognitive monitoring — the ability to track your own certainty — is how humans regulate their own learning without a teacher.
The connection to Does binary reward training hurt model calibration? is complementary: that work adds calibration as an explicit second reward term; RLSF uses calibration itself as the primary reward. Both address the same RLHF-induced calibration degradation from different angles.
The risk is the same as Does self-consistency reliably reward correct answers during training? — confidence and self-consistency are correlated proxies, both vulnerable to the model becoming confidently wrong. But RLSF's emphasis on calibration (making confidence track accuracy) is explicitly designed to resist this — the model is rewarded for being accurately confident, not just confident.
Extensions to general domains via RLPR and INTUITOR: Two RLVR papers extend intrinsic reward signals beyond math to general domains. RLPR (RL from LLM Intrinsic Probability) computes the model's token-level probability of generating a reference answer, using this as reward signal — the model's own knowledge about what constitutes a correct answer replaces external verifiers. INTUITOR goes further: it uses self-certainty as the sole reward signal, computed as the confidence gap between the model's top-choice answer and alternatives. Both extend verifiable-reward RL to domains without rule-based verifiers (medicine, law, open-ended reasoning) — precisely the domains where external verification infrastructure is hardest to build. The convergence with RLSF is notable: all three use the model's internal probability landscape as reward, but RLSF targets calibration restoration, RLPR targets domain extension, and INTUITOR targets complete verifier independence. See Can model confidence alone replace external answer verification?.
Inquiring lines that use this note as a source 194
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How do belief distributions help systems recover from speech recognition errors?
- Does the same uncertainty-driven logic appear in other conversation systems?
- How does RLHF labeler identity shape the values AI systems learn?
- Can a single LLM weight set be optimized for both stake-taking and conversational helpfulness?
- How does RLHF training encode values into AI systems?
- How should we redesign benchmarks to catch conservative bias in reasoning tasks?
- What happens when validation pressure triggers escalating persuasion in language models?
- Do language models share the same cooperative truth-seeking rules as humans?
- How do models integrate conflicting signals in reasoning tasks?
- What would it mean to assign explicit trust weights to synthetic data?
- Does RLHF training create models that sound convincing without being more accurate?
- Does uncertainty quantification in model responses reduce persuasive impact on audiences?
- Does self-conditioning improve belief-behavior alignment better than external priors?
- Do verbal uncertainty estimates calibrate better than confidence scores for personalization?
- Why does combining natural language with numerical scores improve prediction accuracy?
- What calibration corrections can reduce LLM judge bias in automated evaluation pipelines?
- Why does binary reward forcing degrade model calibration?
- How does RLHF reward structure incentivize agreement over accuracy?
- Does user preference for confirmation override model capability for disagreement?
- How do models signal knowledge gaps through token probability?
- Can models learn when to invoke search during reasoning tasks?
- How does step-level confidence filtering compare to global confidence averaging?
- Why do reward models trained for accuracy ignore important context about the input?
- Can single models correct their own beliefs without amplifying confidence in wrong answers?
- Can log-likelihood loss combined with binary rewards achieve calibration?
- Do models actually self-assess their confidence or just confirm answers?
- How do reward model ensembles improve robustness to miscalibration?
- What makes the Brier score mathematically better than log-likelihood here?
- How do we assign confidence and polarity scores to belief edges?
- Can synthetic self-play data teach models when to disagree?
- Can high-entropy tokens and step-level confidence identify the same critical reasoning forks?
- How can stochastic beam search operationalize step-level confidence into a decoding algorithm?
- Why do linguistic hedging markers correlate with internal confidence signals in reasoning traces?
- What mechanism causes confident false answers under high cognitive load?
- Why does reasoning fine-tuning reduce model abstention capacity by 24 percent?
- What happens when confident language masks uncertainty in AI outputs?
- How does optimizing model performance decouple from optimizing user interpretability?
- How do critique models prevent policy entropy collapse during reasoning training?
- Why does natural language feedback break performance plateaus that numerical rewards alone cannot?
- Are larger models and search access substitutes for factual accuracy?
- How reliable is the top-2 confidence gap as a stopping signal across tasks?
- Can verifier-guided search catch factual errors that reasoning training cannot?
- Does RLHF training suppress exploratory and qualifying language?
- What causes snowball errors to accumulate across reasoning steps in language models?
- Why does domain accuracy improve while reasoning quality degrades after supervised fine-tuning?
- Can users learn to discount fluency as a signal of their competence?
- Why does NLI fine-tuning amplify frequency bias instead of teaching inference?
- Does supervised fine-tuning improve accuracy while damaging the quality of reasoning?
- Does reasoning fine-tuning actually reduce a model's ability to abstain?
- Why do diffusion LLM answer tokens converge in confidence long before reasoning stabilizes?
- What causes gradient-based steering via natural language descriptions to work?
- Does RLHF training specifically teach models to prioritize user agreement over accuracy?
- Can preference optimization training make models worse at detecting false presuppositions?
- Does optimizing for model confidence actually improve both performance and calibration simultaneously?
- How do probability-based rewards compare to self-consistency as training signals for reasoning?
- What happens when confident wrong answers become more rewarded than uncertain correct ones?
- Can unsupervised confidence-based training scale to domains beyond human evaluation reach?
- How does reward model training permit spurious correlations in scoring?
- How does self-revision on wrong answers increase model confidence further?
- Can uncertainty estimates based on model self-assessment reliably signal errors?
- How does evaluation format change what we measure about model reasoning?
- How does fine-tuning on natural language inference affect fallacy susceptibility?
- How does RLHF training incentivize confident guessing over grounding acts?
- Can language models correct false assumptions or only reinforce them?
- Does model confidence actually correlate with robustness against prompt variations?
- How does self-revision in reasoning chains amplify confidence in wrong answers?
- Can critic model trios evaluate reasoning quality more reliably than outcome rewards alone?
- Why does RLHF degrade model calibration despite improving preference alignment?
- What makes accurate confidence different from confident-but-wrong predictions?
- Can reward models trained for engagement fix the informativeness problem?
- Why does single-model self-revision amplify confidence in incorrect answers?
- Why does single-agent self-revision amplify confidence in wrong answers over time?
- How much does confidence-guided cascading between SAS and MAS improve accuracy?
- What distinguishes verifiable rewards from preference-based rewards in unified training?
- How does RLHF training for helpfulness create systematic misinterpretation patterns?
- Why does RLHF training push language models toward overly cheerful personas?
- How do semantic reward shaping approaches compare to full critique models?
- Can textual gradients generalize natural language feedback across computation graphs?
- Can suppressing incorrect behavior alone solve the diversity bottleneck in reasoning RL?
- Can RLHF training push models away from human-like lexical patterns?
- How does RLHF helpfulness training drive premature assumptions in multi-turn dialogue?
- What makes reasoning-specific post-training different from standard parameter scaling?
- Does model confidence actually explain why paraphrases produce different outputs?
- Can inflection points in reasoning detect when models genuinely change their minds?
- Can model confidence signals replace explicit external reward functions?
- How does factoring perception from reasoning improve sparse-label learning?
- Can counterfactual data augmentation fully eliminate preference model miscalibration?
- Can RL with verifiable rewards improve dialogue quality better than preference optimization?
- Why do language models prefer accommodating false information over rejecting it?
- How can reward structures teach models when to speak and when to stay silent?
- How does model confidence relate to exemplar brittleness in chain-of-thought?
- Does high model confidence increase the risk of human overreliance?
- Why does prompt sensitivity vanish when model confidence is high?
- Can reward-guided decoding replace weight fine-tuning for personalized alignment?
- Why do reasoning models confidently generate wrong answers instead of abstaining?
- What happens when reasoning fine-tuning eliminates model refusal mechanisms entirely?
- Do base models and reasoning models fail in opposite directions on uncertainty?
- How can we measure whether process rewards actually align with reasoning quality?
- Can training on reasoning traces teach actual self-correction or only confident first answers?
- Why do reasoning models amplify confidence in incorrect answers during self-revision?
- How do task-type perceptions like chat versus reasoning guide different reward strategies?
- Why does supervised fine-tuning degrade reasoning quality despite raising accuracy?
- Can random rewards improve reasoning models if pretraining is suitable?
- Can structured natural language feedback outperform scalar rewards in RL?
- How do confidence signals differ between implicit feedback and explicit ratings?
- Can external classifiers reliably decide when a model should reason?
- What causes length bias in language model reward models?
- Does reasoning fine-tuning actually damage a model's ability to abstain?
- Can language models accurately evaluate the quality of their own reasoning?
- Can semantic entropy improve model calibration without external ground truth?
- How does semantic entropy compare to confidence scores from internal model probabilities?
- Can models maintain auditable reasoning while achieving high accuracy?
- Are reasoning models more vulnerable to persuasion than standard models?
- How should dialogue systems represent and update uncertainty from noisy ASR input?
- What filtering criteria best identify student-compatible refinements from teacher models?
- Does reasoning fine-tuning actually harm a model's ability to abstain?
- Can proper scoring rules restore model calibration without sacrificing accuracy?
- Can intrinsic confidence signals improve both calibration and reasoning performance?
- How does model confidence relate to accuracy in underfitted domains?
- Why does model self-revision increase confidence while degrading accuracy?
- Does training on self-play disagreement data improve multi-agent reasoning outcomes?
- Can models become more convincing without becoming more correct?
- What alternatives to RLHF better preserve truth-seeking in AI outputs?
- Does internal self-revision actually degrade reasoning accuracy in models?
- How does RLHF training push chatbots toward problem-solving over exploration?
- Why does probability of text completion not equal knowledge value?
- Can preference model training be redesigned to prioritize factual correction over user agreement?
- Does training for persuasiveness harm a model's factual accuracy?
- Can weaker models match stronger ones with sufficient search and reasoning budget?
- Can warmth training in language models actually reduce their reliability?
- Can we improve reasoning by amplifying information at mutual information peaks?
- Can reward design fix the conflict between reasoning accuracy and abstention calibration?
- Can confidence levels reliably detect when a model is overthinking?
- How do surface signals like confidence override actual quality in user judgment?
- Why is confidence a dangerous proxy for accuracy in human-AI interaction?
- How do linguistic norms for expressing certainty vary across languages and models?
- Can emotion-grounded rewards replace coarse bonus signals in hierarchical dialogue RL?
- Can runtime confidence signals detect when reasoning has crossed the overthinking threshold?
- Can layer-wise prediction stabilization identify when genuine reasoning has stopped?
- Can structural conversation analysis replace text-based reward signals for AI alignment?
- Does supervised fine-tuning improve reasoning or just response formatting?
- When does outcome reward signal become informative during model training?
- Why does belief-shift reward enable smaller models to match larger baselines?
- Does belief-shift credit assignment generalize to tasks without ground-truth outcomes?
- How do dense token-level rewards compare to sparse task-level verification signals?
- Can smaller judge models better capture human preferences than larger prompted models?
- Does RLHF training make explanations more deceptive than transparent?
- How do reasoning-invariant tokens dilute learning signals in uniform averaging?
- Can step-level confidence filtering work better than global confidence scoring?
- Why do reward models fail to recognize genuinely different valid answers?
- Can models maintain reasoning-output coupling while improving domain accuracy?
- Can thought quality alone be trusted to guide model training?
- Do larger language models overcome greediness in sequential decision-making?
- Why does naive randomness fail to improve stochastic latent reasoning models?
- Does model uncertainty overwhelm persona-specific signal in conditioned predictions?
- Why do reasoning models exhibit self-doubt about their own early assessments?
- Why does reasoning fine-tuning suppress the confidence signals that adaptive retrieval needs?
- What role should reasoning agents play in validating multi-LLM ensemble outputs?
- Why does reasoning catalyst data remain stable across multiple self-improvement iterations?
- Can the same variance signal work as both reward and query filter?
- Why do outcome-based rewards train language models to over-engage rather than abstain?
- How do confidence thresholds compare to learned policies for triggering retrieval?
- Can verifier-free RL work without manual preference labels or task-specific training?
- Why does prompting discover capabilities that need reward-driven refinement?
- Can language models function as implicit process reward models through retrospection?
- How does uncertainty verbalization change student robustness across domains?
- Can teachers trained under uncertainty constraints distill better generalizing students?
- Are different reward signal sources substitutable in verifier-free RL?
- Can models possess latent reasoning capability that training signals fail to unlock?
- How does self-distillation degrade reasoning by suppressing uncertainty signals?
- Why do shorter confident reasoning traces fail on out-of-distribution problems?
- Do reasoning models need to verbalize doubt to correct their own mistakes?
- How can structured reasoning templates serve as rewards for code agent training?
- How do frontier models maintain agreement scores above 90 percent across reasoning tasks?
- What makes step-wise rewards denser than final-answer correctness signals?
- How do miscalibrated confidence signals affect the success of SmartPause routing?
- How does structured self-dialogue improve uncertainty assessment over confidence scores?
- Does RL training redirect self-doubt into productive gap analysis?
- How does confidence filtering improve selection of reasoning traces?
- Why does reinforcement learning training degrade model calibration?
- Can approximate or noisy reference answers work for RL-based reasoning training?
- How can language models extract more value from fewer demonstrations?
- Can information-gain principles improve how we choose what to label?
- Can question-only features replace model uncertainty checks at scale?
- What makes uncertainty calibration harder than expanding knowledge?
- Does premature confidence signal flawed reasoning in language models?
- How does expressing uncertainty help models avoid the answer-or-abstain dilemma?
- How does linguistic calibration differ from token probability calibration?
- What makes user-decision rewards better than model-confidence rewards?
- When does reinforcement learning actually produce true reasoning gains in models?
- How can we turn reasoning model failures into useful training signals?
- Can calibrated confidence reduce misleading consensus in group deliberation?
- How does preference learning differ from supervised finetuning for reasoning?
- How does evaluation setting affect measured reasoning capabilities in language models?
Related concepts in this collection 6
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does binary reward training hurt model calibration?
Explores whether the standard correctness-based reward in RL training creates incentives for overconfident predictions, and what structural problem causes calibration to degrade during optimization.
complementary approach: explicit calibration reward term vs calibration as primary reward
-
Does self-consistency reliably reward correct answers during training?
Self-consistency initially correlates with correctness, but as models train on this signal, do they eventually learn to maximize consistency itself rather than accuracy? When does this proxy reward stop working?
RLSF shares the proxy reward structure but explicitly targets calibration to resist the hacking failure mode
-
Do users worldwide trust confident AI outputs even when wrong?
Explores whether the tendency to over-rely on confident language model outputs transcends language and culture. Understanding this pattern is critical for designing safer human-AI interaction across diverse linguistic contexts.
RLSF addresses the upstream cause: if models are better calibrated, user overreliance on confidence signals becomes less dangerous
-
Can model confidence alone replace external answer verification?
Can LLMs use their own certainty signals instead of external verifiers to improve reasoning? This matters for scaling beyond domains where correct answers can be automatically checked.
extends: RLPR/INTUITOR use intrinsic probability for domain extension; RLSF uses confidence for calibration restoration
-
Does preference optimization harm conversational understanding?
Exploring whether RLHF training that rewards confident, complete responses undermines the grounding acts—clarifications, checks, acknowledgments—that actually build shared understanding in dialogue.
RLSF addresses one specific dimension of the alignment tax: RLHF degrades both calibration and conversational grounding; RLSF patches the calibration damage by using confidence as intrinsic reward, showing that some alignment costs are design choices that can be reversed without undoing alignment benefits
-
Can we detect when language models confabulate?
Current uncertainty metrics fail to catch inconsistent outputs that look confident. Could measuring semantic divergence across samples reveal confabulation signals that token-level metrics miss?
RLSF's model confidence and semantic entropy are complementary self-referential uncertainty signals: RLSF uses internal token probabilities to restore calibration during training, while semantic entropy uses sampled output clustering to detect confabulations at inference; both bypass the need for external ground truth
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Post-Training Large Language Models via Reinforcement Learning from Self-Feedback
- Beyond Binary Rewards: Training LMs to Reason About Their Uncertainty
- Understanding and Mitigating Premature Confidence for Better LLM Reasoning
- A Survey on Post-training of Large Language Models
- Local Coherence or Global Validity? Investigating RLVR Traces in Math Domains
- RLPR: Extrapolating RLVR to General Domains without Verifiers
- RM-R1: Reward Modeling as Reasoning
- Reward Reasoning Model
Original note title
model confidence as intrinsic reward simultaneously restores calibration and improves reasoning — unlike RLHF which optimizes preference at the cost of calibration