Can runtime confidence signals detect when reasoning has crossed the overthinking threshold?
This explores whether a model's own confidence — read live as it generates — can flag the moment reasoning tips from helpful into wasteful overthinking, and whether that signal is reliable enough to act on.
This question is really two stacked claims: first, that overthinking is a real, locatable threshold, and second, that confidence signals can sense when a reasoning trace has crossed it. The corpus supports the first strongly. Accuracy doesn't climb forever with more thinking — it peaks and then falls, dropping from 87.3% to 70.3% as thinking tokens scale from ~1,100 to ~16,000 Does more thinking time always improve reasoning accuracy?, When does thinking too much actually hurt reasoning?. Extra tokens past the peak don't refine the answer; they inflate output variance and inject self-revision errors. So there genuinely is a line to detect.
On the second claim, the most direct evidence is ReBalance, which treats confidence not as a final score but as a running diagnostic: confidence variance and overconfidence patterns reveal whether a model is spinning redundantly (overthinking) or quitting too early (underthinking), and it uses those readings to steer the trace mid-flight without any retraining Can confidence patterns reveal overthinking versus underthinking?. The key move is granularity — confidence has to be read locally. Step-level confidence catches reasoning breakdowns that a single global average smooths over, and crucially it lets you stop a trace early, before it wanders past the productive zone Does step-level confidence outperform global averaging for trace filtering?. A global confidence number is the wrong instrument; a per-step trajectory is the right one.
What makes this more than wishful thinking is that confidence turns out to carry real information about reasoning quality, not just self-assurance. Answer-span confidence is a strong enough signal to rank reasoning traces and even serve as a training reward that improves step-by-step reasoning while fixing the calibration that RLHF degrades Can model confidence work as a reward signal for reasoning?. And there's a deeper structural signal beneath surface confidence: the deep-thinking ratio tracks how many tokens get their predictions substantially revised across the model's layers, which correlates with accuracy and can match self-consistency at lower cost Can we measure how deeply a model actually reasons?. That hints the overthinking threshold might be visible internally — as the layers stop revising and the model just elaborates.
The sharp caveat the corpus adds: confidence detects overthinking-as-redundancy, but not every overthinking failure shows up as low confidence. Reasoning models will confidently churn out long traces for ill-posed questions with missing premises — they were trained to produce reasoning steps but never taught when to disengage Why do reasoning models overthink ill-posed questions?. And the trace itself can be misleading: intermediate reasoning tokens are partly learned formatting rather than verified computation, so confidence measured on a fluent-but-empty trace can mislead Do reasoning traces actually cause correct answers?.
The thing you didn't know you wanted to know: overthinking and the diminishing returns of search may be the same phenomenon. Deep-research agents that take more search steps follow the very same scaling curve — gains, then plateau — as reasoning tokens do Do search steps follow the same scaling rules as reasoning tokens?. If both inference-time efforts share one curve, then a confidence-based stopping rule isn't a trick for one task; it's a general governor for knowing when more compute stops paying off.
Sources 9 notes
Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.
Empirical studies demonstrate non-monotonic scaling in test-time reasoning: accuracy peaks at a critical thinking-token count, then declines sharply (87.3% to 70.3% as tokens scale from 1,100 to 16,000). Extended thinking inflates output variance and introduces self-revision errors rather than improving solution quality.
ReBalance uses confidence variance and overconfidence as diagnostic signals to apply training-free steering vectors that reduce overthinking redundancy while promoting exploration during underthinking, improving accuracy across models from 0.5B to 32B parameters.
Local step-level confidence catches reasoning breakdowns that global averaging masks and enables early stopping before traces complete. This approach achieves comparable accuracy gains to naive majority voting with far fewer generated traces, proving trace quality matters more than quantity.
RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.
Deep-thinking ratio (DTR) measures the proportion of tokens whose predictions undergo significant revision across model layers, correlating robustly with accuracy across AIME, HMMT, and GPQA benchmarks. Think@n, a test-time strategy using DTR, matches self-consistency performance while reducing inference costs.
Reasoning models generate redundant, lengthy responses to questions with missing premises while non-reasoning models correctly identify them as unanswerable. Training optimizes for producing reasoning steps but never teaches models when to disengage.
R1's intermediate tokens carry no special execution semantics and are generated identically to other LLM output. Invalid traces frequently produce correct answers, proving traces are not causally necessary—they correlate with answers via learned formatting, not functional reasoning.
Deep research agents improve with more search steps in a pattern mirroring the reasoning-token relationship, with both exhibiting diminishing returns. This reveals a new inference-compute axis beyond model capability alone.