SYNTHESIS NOTE
Training, RL, and Test-Time Scaling Reasoning, Retrieval, and Evaluation Model Architecture and Internals

Does step-level confidence outperform global averaging for trace filtering?

Explores whether measuring confidence at individual reasoning steps—rather than averaging across entire traces—better identifies and filters out low-quality reasoning. Matters because it could dramatically improve both accuracy and compute efficiency in multi-trace reasoning.

Synthesis note · 2026-02-20 · sourced from Test Time Compute
How should we allocate compute budget at inference time?

Standard majority voting treats all reasoning traces equally. DeepConf improves on this by filtering traces based on model-internal confidence signals — and the key finding is that local (step-level) confidence is more informative than global confidence averaged across the full trace.

Global confidence fails in two ways: (1) it averages over the entire trace, masking critical reasoning breakdowns at specific intermediate steps; (2) it requires the full trace to be generated before it can be computed, preventing early stopping.

Step-level confidence catches local failures as they occur. A single low-confidence step is a signal worth acting on immediately, before it compounds through subsequent reasoning. This enables early termination of low-quality traces, reducing unnecessary token generation while maintaining or improving accuracy.

The practical payoff: getting from 68% to 82% accuracy on AIME 2025 via standard majority voting requires 511 additional traces per question with Qwen3-8B. Confidence-aware filtering achieves similar accuracy gains with far fewer traces. The compute efficiency argument for trace filtering is strong.

The implication: trace quality is more relevant than trace quantity for aggregation, and local confidence is a better quality proxy than global confidence or trace length.

Self-Evaluation Guided Beam Search as decoding implementation: The Self-Evaluation approach (Xie et al., 2023) translates step-level confidence into a decoding algorithm. It defines a constraint function C(st, s1:t-1) ∈ [0,1] that outputs the LLM's confidence in the correctness of each reasoning step given prior context. This confidence guides a stochastic beam search: each "step" in beam search is a semantic reasoning unit (not a single token), and the self-evaluation score serves as a better-calibrated automatic criterion for pruning the search. Stochastic beam search balances exploitation (following high-confidence paths) and exploration (temperature-controlled randomness to avoid premature convergence). This operationalizes step-level confidence as a search mechanism rather than just a filter.

Inquiring lines that use this note as a source 175

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 6

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
20 direct connections · 182 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

confidence-aware step-level filtering outperforms global confidence averaging for trace selection