SYNTHESIS NOTE
Reasoning, Retrieval, and Evaluation Training, RL, and Test-Time Scaling Psychology, Society, and Alignment

Can confidence trajectories reveal when reasoning goes wrong?

Does the timing of when a model commits to an answer predict whether its reasoning will be flawed? And can we use this signal to train better reasoning without expensive annotations?

Synthesis note · 2026-06-03 · sourced from Reinforcement Learning

Long chains of thought often contain logical gaps and unjustified leaps, so the extra reasoning tokens fail to deliver the gains they should. Improving reasoning quality directly would require process reward models, but the step-level annotations to train them are expensive and scarce — which is why RL on reasoning mostly relies on outcome rewards that improve answers without examining how they were reached.

The paper finds the missing signal in the model's own confidence trajectory. Premature confidence — committing to an answer early and using the remaining tokens to rationalize it — strongly predicts flawed reasoning across tasks and model scales. It is a quantitative, annotation-free indicator of post-hoc rationalization. That makes it usable as a training signal: progressive confidence shaping is an RL objective that rewards gradual confidence growth and penalizes early commitment, with no external labels or reward models. Gains are large — on Countdown, accuracy improves 3.2× (+42pp) and flawed reasoning drops 48pp; AIME Pass@64 improves 6.6pp from 1.5B to 8B.

The contribution is a cheap proxy for process supervision: confidence dynamics stand in for the step-level annotations a PRM would need. It connects directly to Does chain-of-thought reasoning reflect genuine thinking or performance? — that note establishes early commitment as a measurable phenomenon; this one turns it into a trainable objective. It also rhymes with Do reasoning models switch between ideas too frequently?: both treat a confidence/attention dynamic as the lever, not the final answer.

Inquiring lines that use this note as a source 7

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
14 direct connections · 138 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

premature confidence is an annotation-free signal of flawed reasoning — rewarding gradual confidence growth improves reasoning without process labels