INQUIRING LINE

How do high-entropy tokens concentrate reinforcement learning's effect?

This explores why a small fraction of 'high-entropy' tokens — the moments where a model is genuinely uncertain which word comes next — seem to carry most of what reinforcement learning actually teaches, and what that concentration costs.


This explores why RL's learning signal piles up on a minority of uncertain tokens rather than spreading evenly across everything a model generates. The cleanest result in the corpus comes from work showing that only about 20% of tokens carry high entropy, and these are the 'forking points' — the decision moments where the model could branch in different reasoning directions. Train on just those tokens and you match or even beat updating on all of them Do high-entropy tokens drive reasoning model improvements?. The takeaway is counterintuitive: most of the tokens a model emits are low-stakes continuations the policy is already confident about, so the gradient they contribute is near-noise. The learning lives at the branch points.

That picture sharpens when you look at where those branch points actually sit in a reasoning trace. One study tracking RL across eight models found training splits into two phases — first the model nails execution (the mechanical steps), then the bottleneck shifts to strategy and planning. Crucially, planning-token entropy rises while execution-token entropy settles, and concentrating optimization on those high-entropy planning tokens is where the late-stage gains come from Does RL training follow a predictable two-phase learning sequence?. So 'high-entropy tokens' aren't a fixed set — they migrate to wherever the model is still genuinely deciding something, and RL chases them there.

But concentration has a dark side, and this is the part you might not expect. The same entropy that marks the productive forking points is exactly what RL tends to destroy. As training proceeds, policy entropy collapses toward zero and performance hits a predictable ceiling described by a clean exponential law — the model converges on a narrow set of reward-maximizing moves and stops exploring Does policy entropy collapse limit reasoning performance in RL?. The same squeeze shows up in search agents, where RL compresses behavioral diversity through the identical mechanism, while plain supervised fine-tuning on varied demonstrations keeps exploration broad Does reinforcement learning squeeze exploration diversity in search agents?. So RL concentrates its effect on high-entropy tokens and, in doing so, tends to spend down the very entropy it depends on.

That tension is why a cluster of methods try to feed richer signal into those decision points instead of just rewarding the final answer. Natural-language critiques can break performance plateaus that numerical rewards can't, because a scalar tells the model it failed but not why or where Can natural language feedback overcome numerical reward plateaus?. Other work converts dense, token-level environment feedback into per-token credit so the gradient knows which specific decisions went wrong Can environment feedback replace scalar rewards in policy learning?. And reinforcement pre-training reframes ordinary next-token prediction itself as a reasoning task with verifiable rewards — essentially building high-entropy decision structure into the model from the start, so later RL fine-tuning has better forking points to sharpen Can next-token prediction become a reasoning task with RL?.

Put together, the corpus tells a single story with a built-in catch: RL works by concentrating on the handful of tokens where the model is still genuinely uncertain, those tokens shift toward strategy as training matures — and the open problem is doing this without collapsing the uncertainty that made those tokens worth learning from in the first place.


Sources 7 notes

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Does policy entropy collapse limit reasoning performance in RL?

Empirical law R = -a·exp(H) + b shows performance saturates when policy entropy approaches zero. Interventions like Clip-Cov, KL-Cov, and GPPO preserve exploratory capacity by managing entropy reduction during training.

Does reinforcement learning squeeze exploration diversity in search agents?

RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.

Can natural language feedback overcome numerical reward plateaus?

Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.

Can environment feedback replace scalar rewards in policy learning?

SDPO converts tokenized environment feedback into dense gradient signals by using the feedback-conditioned policy as a self-teacher. The policy, when given retrospective evidence of its mistakes in-context, implicitly acts as its own process reward model, making external reward signals unnecessary.

Can next-token prediction become a reasoning task with RL?

Reinforcement Pre-Training transforms next-token prediction into a reasoning task by providing verifiable rewards from the corpus itself, eliminating reward hacking and enabling inference-time scaling during pretraining. This suggests token-level reasoning patterns during pretraining strengthen downstream RL fine-tuning.

Next inquiring lines