Does step-level confidence outperform global averaging for trace filtering?

Explores whether measuring confidence at individual reasoning steps—rather than averaging across entire traces—better identifies and filters out low-quality reasoning. Matters because it could dramatically improve both accuracy and compute efficiency in multi-trace reasoning.

Synthesis note · 2026-02-20 · sourced from Test Time Compute

Standard majority voting treats all reasoning traces equally. DeepConf improves on this by filtering traces based on model-internal confidence signals — and the key finding is that local (step-level) confidence is more informative than global confidence averaged across the full trace.

Global confidence fails in two ways: (1) it averages over the entire trace, masking critical reasoning breakdowns at specific intermediate steps; (2) it requires the full trace to be generated before it can be computed, preventing early stopping.

Step-level confidence catches local failures as they occur. A single low-confidence step is a signal worth acting on immediately, before it compounds through subsequent reasoning. This enables early termination of low-quality traces, reducing unnecessary token generation while maintaining or improving accuracy.

The practical payoff: getting from 68% to 82% accuracy on AIME 2025 via standard majority voting requires 511 additional traces per question with Qwen3-8B. Confidence-aware filtering achieves similar accuracy gains with far fewer traces. The compute efficiency argument for trace filtering is strong.

The implication: trace quality is more relevant than trace quantity for aggregation, and local confidence is a better quality proxy than global confidence or trace length.

Self-Evaluation Guided Beam Search as decoding implementation: The Self-Evaluation approach (Xie et al., 2023) translates step-level confidence into a decoding algorithm. It defines a constraint function C(st, s1:t-1) ∈ [0,1] that outputs the LLM's confidence in the correctness of each reasoning step given prior context. This confidence guides a stochastic beam search: each "step" in beam search is a semantic reasoning unit (not a single token), and the self-evaluation score serves as a better-calibrated automatic criterion for pruning the search. Stochastic beam search balances exploitation (following high-confidence paths) and exploration (temperature-controlled randomness to avoid premature convergence). This operationalizes step-level confidence as a search mechanism rather than just a filter.

Inquiring lines that use this note as a source 175

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 6

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

20 direct connections · 182 in 2-hop network ·dense cluster Open in graph ↗

Does step-level confidence outperform global ave… Why does majority voting outperform more complex i… Do hedging markers actually signal careful thinkin… Do high-entropy tokens drive reasoning model impro… Can we measure how deeply a model actually reasons… Can reasoning steps be dynamically pruned without … Do reflection tokens carry more information about …

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Why does majority voting outperform more complex inference methods? Simple majority voting across independent samples often matches or beats sophisticated alternatives like Best-of-N and sequential revision. What makes this basic approach so hard to beat for reasoning models?
confidence-aware filtering as an improvement on naive majority voting
Do hedging markers actually signal careful thinking in AI? Explores whether linguistic markers like "alternatively" and "however" in model outputs correlate with accuracy or uncertainty. This matters because users often interpret such language as a sign of trustworthy reasoning.
linguistic confidence signals and internal confidence signals may converge
Do high-entropy tokens drive reasoning model improvements? Explores whether only a small fraction of tokens—those with high entropy at decision points—actually matter for improving reasoning performance in language models, and whether training on them alone could work as well as full training.
extends: high-entropy tokens are forking decision points; step-level confidence at those forks is precisely where filtering signal concentrates, so step-level filtering targets the same minority tokens that carry RLVR's training signal
Can we measure how deeply a model actually reasons? What if reasoning quality isn't about length or confidence, but about how much a model's predictions shift across its internal layers? Can tracking these shifts reveal genuine thinking versus pattern-matching?
complements: DTR provides a stronger trace-quality signal than confidence alone (layer-wise stabilization); together with step-level confidence they define a two-channel filtering criterion (computational depth + step certainty)
Can reasoning steps be dynamically pruned without losing accuracy? This explores whether chain-of-thought reasoning contains redundant steps that can be identified and removed during inference. Understanding which steps matter could improve efficiency while maintaining correctness.
extends: PI categorizes step types and shows verification/backtracking steps receive minimal subsequent attention; this gives a structural complement to confidence-based filtering — drop steps that are both low-confidence AND attention-invisible
Do reflection tokens carry more information about correct answers? Explores whether tokens expressing reflection and transitions concentrate information about reasoning outcomes disproportionately compared to other tokens, and what role they play in reasoning performance.
grounds: MI peaks identify which tokens carry signal about correctness; step-level confidence converges on the same sparse tokens through a different measurement channel

Does step-level confidence outperform global averaging for trace filtering?

Related concepts in this collection 6

Related papers in this collection 8

Search by related questions 4