INQUIRING LINE

Do high-entropy RLVR tokens correspond to MI-peak tokens during inference?

This asks whether the small set of high-entropy 'forking' tokens that RLVR training targets are the same tokens that carry the most information (mutual-information peaks) when a model is actually generating an answer — and the corpus has strong material on the first half but only circumstantial material on the correspondence itself.


This explores a two-part claim: (1) that RLVR concentrates its learning on a minority of high-entropy tokens, and (2) that those same tokens are the high-information decision points during inference. On the first half, the corpus is direct and emphatic. Only about 20% of tokens show high entropy, and those tokens act as pivotal reasoning decision points — the places where a reasoning trace could branch one way or another. Training on just that 20% matches or even beats full-gradient training, which means the minority is where the actual learning signal lives Do high-entropy tokens drive reasoning model improvements?. So the 'forking point' framing is well supported.

The second half — whether high entropy lines up with mutual-information peaks at inference — isn't measured head-on in this collection, but several notes circle the same territory under different vocabulary. The throughline across the RLVR work here is that the method doesn't teach new reasoning; it re-weights and sharpens behaviors already latent in the base model. RLVR improves sampling efficiency without expanding the reasoning boundary Does RLVR actually expand what models can reason about?, activates pretraining strategies rather than installing them What does reward learning actually do to model reasoning?, and tends to amplify one dominant pretraining format while collapsing the alternatives Does RL training collapse format diversity in pretrained models?. If RLVR works by concentrating probability mass at exactly the branch points where the outcome is still uncertain, then high-entropy tokens and high-information tokens would be describing the same junctions from two angles — entropy is the model's uncertainty there, mutual information is how much the eventual answer depends on which way it forks.

That's a clean story, but the corpus also gives you reasons to be careful about assuming the correspondence is tight. RLVR can improve the local coherence of a trace — fewer logical errors between adjacent steps — without making the global proof valid Does RLVR actually improve mathematical reasoning or just coherence?. In other words, the model can get more confident (lower entropy) at exactly the steps that matter most for the answer (high mutual information) while still being wrong. Entropy reduction and genuine information gain can decouple. And when the training signal is poorly chosen — overly hard samples, contaminated rewards — the high-advantage tokens RLVR latches onto can be accidental shortcuts rather than real forking decisions Do overly hard RLVR samples actually harm model capabilities?, which would put the high-entropy tokens and the truly informative tokens in different places.

There's a useful adjacent signal too: calibrated token-probability uncertainty turns out to be a more reliable guide than external heuristics for deciding when a model needs help, e.g. when to retrieve Can simple uncertainty estimates beat complex adaptive retrieval?. That's indirect evidence that per-token uncertainty really does track the consequential moments in a generation — which is the bet underlying any high-entropy-equals-high-information argument.

The honest bottom line: the corpus strongly establishes that high-entropy tokens are the load-bearing tokens for RLVR, and it makes the entropy↔information correspondence plausible as a mechanism. But it does not contain a note that directly measures mutual information at inference and aligns it against the high-entropy set, and it actively warns that confidence and informativeness can come apart. If you want to chase this further, the forking-points note Do high-entropy tokens drive reasoning model improvements? is the place the correspondence is implicitly assumed, and the coherence-vs-validity note Does RLVR actually improve mathematical reasoning or just coherence? is the place it might break.


Sources 7 notes

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Does RLVR actually expand what models can reason about?

Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Does RLVR actually improve mathematical reasoning or just coherence?

RLVR post-training measurably reduces logical errors between adjacent reasoning steps, but locally coherent traces can still be globally invalid proofs. The improvement is structural rather than semantic.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Can simple uncertainty estimates beat complex adaptive retrieval?

Calibrated token-probability uncertainty consistently beats multi-call adaptive retrieval on single-hop tasks and matches performance on multi-hop, using a fraction of the LM and retriever calls. The model's self-knowledge proves more reliable than external heuristics for deciding when to retrieve.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a mechanistic interpreter evaluating whether high-entropy tokens during RLVR training align with mutual-information peaks at inference. The question remains open: does entropy concentration map onto decision-critical information?

What a curated library found — and when (dated claims, not current truth):
Findings span 2025–2026. The library is emphatic on the first half of the claim:
- Only ~20% of tokens show high entropy during RLVR; those tokens are "forking points" where reasoning traces branch, and training on just that 20% matches or exceeds full-gradient baselines (2025-06, arXiv:2506.01939).
- RLVR does not expand reasoning capacity beyond the base model; it re-weights and sharpens latent behaviors, amplifying one pretraining format while collapsing alternatives (2025-04, arXiv:2504.13837; 2025-04, arXiv:2504.07912).
- Per-token uncertainty calibration outperforms heuristic signals for adaptive decisions like retrieval (2025-01, arXiv:2501.12835), suggesting entropy tracks consequential moments.
- BUT: RLVR can improve local trace coherence (fewer adjacent step errors, lower entropy) without guaranteeing global validity—confidence and genuine informativeness decouple (2025-10, arXiv:2510.18176).
- Poorly chosen training signals (hard samples, contaminated rewards) cause high-advantage tokens to be accidental shortcuts rather than true forking decisions (2026-05, arXiv:2605.28388).

Anchor papers (verify; mind their dates):
- arXiv:2506.01939 (2025-06): High-Entropy Minority Tokens Drive RLVR
- arXiv:2504.13837 (2025-04): Does RL Incentivize Reasoning Beyond Base?
- arXiv:2510.18176 (2025-10): Local Coherence vs. Global Validity
- arXiv:2501.12835 (2025-01): Uncertainty & Adaptive Retrieval

Your task:
(1) RE-TEST THE ENTROPY↔MI CORRESPONDENCE. The library establishes high-entropy tokens drive RLVR efficiency but does NOT directly measure mutual information at inference or align it to the high-entropy set. For each finding, separate durable from perishable: Is the 20% minority claim still tight, or have improved training regimes broadened the load-bearing set? Has mechanistic work since isolated whether entropy truly tracks decision criticality at inference, or does confidence-informativeness decoupling persist? Flag where the correspondence still appears to assume rather than measure.
(2) SURFACE THE STRONGEST TENSIONS. The library itself flags the break: RLVR can reduce entropy at steps that matter most for the answer while being wrong globally. Has any work since 2025-10 reconciled this (e.g., by proposing an information-theoretic measure that captures both local coherence and global validity)? Or does the field still treat them as orthogonal?
(3) Propose 2 research questions assuming the regime may have shifted: (a) Do newer training objectives (e.g., outcome-weighted RL, outcome supervision) tighten or dissolve the entropy↔MI link? (b) Can token-level MI be estimated from modern LLM internals (e.g., via causal intervention or attention flow) in a way that empirically grounds the forking-point hypothesis?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines