What causes policy entropy collapse in reasoning-focused reinforcement learning?
This explores why reasoning-focused RL training tends to lose exploratory diversity — the policy's entropy collapsing toward zero — and what drives that, drawing on the corpus's empirical laws, mechanistic accounts, and counter-interventions.
This explores why reasoning-focused RL training tends to lose its exploratory diversity — the policy's entropy collapsing toward zero — and what's actually driving it. The clearest answer in the corpus is that entropy collapse isn't a side effect; it's the main bottleneck. There's a strikingly tidy empirical law, R = -a·exp(H) + b, showing that reasoning performance saturates exactly as policy entropy approaches zero — meaning a model trades away its exploratory capacity for reward, and once that capacity is gone, performance hits a ceiling it can't climb past Does policy entropy collapse limit reasoning performance in RL?. The mechanism is convergence: RL pushes the policy to repeatedly exploit whatever narrow strategy maximizes reward, and the distribution sharpens around those few moves until it can no longer try anything new.
The cause becomes more concrete when you look at *which* tokens carry the entropy. Only about 20% of tokens are high-entropy 'forking points' — the genuine decision moments where reasoning could branch one way or another — and RL almost exclusively adjusts these Do high-entropy tokens drive reasoning model improvements?. Entropy collapse, read through that lens, is the flattening of these forks: the model stops hesitating at the junctions where exploration used to happen. That's also why it isn't unique to math reasoning — search agents trained with RL show the same compression of behavioral diversity, converging on narrow reward-maximizing strategies through the identical mechanism, while SFT on diverse demonstrations keeps the exploration breadth alive Does reinforcement learning squeeze exploration diversity in search agents?.
A deeper structural cause is hinted at by the discovery that numerical rewards are simply *information-poor*. A scalar reward tells the model whether it succeeded but nothing about *why* it failed or how to improve — so the optimization has no signal pointing toward alternative strategies, only toward sharpening the one that currently scores. When models stuck on a plateau are instead given chain-of-thought critiques in natural language, they escape it, which suggests the collapse is partly starvation: the reward channel is too thin to sustain exploration Can natural language feedback overcome numerical reward plateaus?.
Usefully, the corpus reframes collapse as one half of a paired failure. Training-time entropy collapse and test-time variance inflation are 'dual' problems — both spring from a broken exploration–exploitation balance, but at different timescales, and a fix for one (entropy bonuses, critique diversity during training) won't touch the other Why do reasoning models fail differently at training versus inference?. That's the non-obvious part: managing entropy isn't a single dial. And the interventions that work — Clip-Cov, KL-Cov, GPPO — all operate by deliberately slowing the rate at which entropy is allowed to drain, rather than by changing the reward Does policy entropy collapse limit reasoning performance in RL?.
If you want to go further, the corpus also complicates the simple 'collapse is bad' story: RL training moves through two phases, and entropy doesn't uniformly fall — execution-token entropy stabilizes while *planning*-token entropy actually rises as strategic exploration becomes the new bottleneck Does RL training follow a predictable two-phase learning sequence?. So the real picture is less 'entropy always collapses' and more 'entropy collapses where the model has stopped needing to explore, and the open question is keeping it alive where exploration still matters.'
Sources 6 notes
Empirical law R = -a·exp(H) + b shows performance saturates when policy entropy approaches zero. Interventions like Clip-Cov, KL-Cov, and GPPO preserve exploratory capacity by managing entropy reduction during training.
Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.
RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.
Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.
Both failures stem from failed exploration-exploitation balance but occur at different timescales requiring structurally distinct interventions. Training-time fixes (entropy bonuses, critique diversity) cannot prevent inference-time variance inflation, and vice versa; both loops must be managed independently.
Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.