SYNTHESIS NOTE
Training, RL, and Test-Time Scaling Reasoning, Retrieval, and Evaluation

Can we identify which tokens actually matter for reasoning?

Most tokens in an answer are determined by language patterns rather than reasoning. Is there a way to distinguish the small fraction of tokens whose certainty genuinely depends on the chain of thought?

Synthesis note · 2026-05-18 · sourced from Reasoning Methods CoT ToT
What actually changes inside a model during RL training? What does reward learning actually do to model reasoning?

DRO introduces a clean operational definition of "the tokens that depend on the reasoning." For each token in a reference answer, measure the model's self-certainty under different sampled chain-of-thought prefixes. Most tokens — articles, connectives, lexically expected words — barely change in certainty across rollouts. A small minority show high variance: their certainty depends on which reasoning path was taken. These are the reasoning-reflective tokens. They are not lexically distinctive — they cannot be identified by surface features — but they carry the answer's actual sensitivity to the reasoning chain.

The implication for reward design is that the signal-to-noise ratio of a uniform average across all reference tokens is bad. Most of the average is dominated by tokens whose certainty is determined by language modeling rather than by reasoning. Whatever differential the reasoning chain produces is diluted by tokens that would have appeared regardless. The variance filter is what isolates the reasoning-bearing fraction of the answer.

Up-weighting these high-variance tokens produces a sharper reward contrast across rollouts in a group. The mechanism is purely statistical — no human annotation, no per-step rubric, no extra model. Cross-rollout variance is computed from the policy's own samples, which makes the method cheap relative to process reward models (PRMs) that require labeled intermediate steps.

The deeper point is that token-level reward dense-ness is not the issue. Token-level dense rewards have been proposed before. The issue is which tokens to weight, and the answer "weight tokens by their variance under different reasoning prefixes" turns out to be a self-supervised filter that recovers the reasoning-bearing dimension without supervision.

This connects to L2T's information-theoretic dense process rewards as an alternative dense-signal strategy: L2T scores reasoning steps by their contribution to answer correctness; DRO scores tokens by their sensitivity to reasoning. Both replace uniform averaging with a structure-aware signal; both achieve sample efficiency by concentrating the gradient where it matters.

Inquiring lines that use this note as a source 5

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
13 direct connections · 128 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

reasoning-reflective tokens are identifiable by high cross-rollout variance under different CoT prefixes — most reference tokens are reasoning-invariant and dilute uniformly-averaged signals