Why does policy entropy collapse predict sigmoid saturation points?
This explores why a model's performance curve flattens out exactly when its exploration runs dry — the link between shrinking policy entropy in RL and the predictable ceiling where extra training stops paying off.
This explores why a model's performance curve flattens out exactly when its exploration runs dry — the connection between collapsing policy entropy during reinforcement learning and the predictable saturation point where more compute buys almost nothing. The cleanest answer in the corpus is an empirical law: performance follows R = -a·exp(H) + b, where H is policy entropy. As entropy falls toward zero, the exponential term vanishes and R presses up against the ceiling b. That's the saturation — not a coincidence but a mathematical consequence of entropy being the fuel that the curve burns. When the policy stops exploring, the reachable upside is already priced in Does policy entropy collapse limit reasoning performance in RL?.
The mechanism is worth making concrete: RL rewards push a policy to concentrate probability on whatever strategy is currently winning. Each update narrows the distribution, entropy drops, and the model samples fewer distinct attempts. Early on this is pure gain — you're cutting bad behaviors. But the same force that sharpens you also strands you, because once entropy is spent you can no longer stumble onto a better strategy than the one you've converged to. Interventions like Clip-Cov, KL-Cov, and GPPO are explicitly attempts to ration that fuel — slowing entropy reduction so the saturation point arrives later and higher Does policy entropy collapse limit reasoning performance in RL?.
What makes this more than a one-paper curiosity is that the same collapse shows up in domains that share none of the original vocabulary. Search agents trained with RL squeeze their behavioral diversity in exactly the way reasoning models do — policies pile onto narrow reward-maximizing routines — and the fix is the same: supervised fine-tuning on diverse demonstrations re-injects the exploration breadth that RL drains Does reinforcement learning squeeze exploration diversity in search agents?. So the sigmoid ceiling isn't about reasoning per se; it's a property of reward-maximizing training under finite exploration.
A few notes hint at why some plateaus aren't truly terminal — which is the same as saying the sigmoid's b can sometimes be lifted. When numerical rewards stall, chain-of-thought critiques can unstick a model, because the plateau was partly an information problem: scalar rewards never told the model *why* it failed Can natural language feedback overcome numerical reward plateaus?. Relatedly, RL training moves through phases — execution correctness saturates first, then strategic planning becomes the binding constraint, and planning-token entropy actually *rises* even as overall behavior narrows Does RL training follow a predictable two-phase learning sequence?. A single global entropy number can therefore hide where the remaining exploration still lives.
The thing you might not have known you wanted to know: entropy collapse is also why RL feels structurally 'cheap.' It rewrites only 5–30% of parameters, and those sparse updates are nearly identical across random seeds — the policy isn't exploring a wide space and landing somewhere idiosyncratic, it's funneling toward the same narrow basin every time Does reinforcement learning update only a small fraction of parameters?. The convergence that produces saturation and the convergence that produces near-deterministic, sparse parameter changes are two views of the same shrinking distribution.
Sources 5 notes
Empirical law R = -a·exp(H) + b shows performance saturates when policy entropy approaches zero. Interventions like Clip-Cov, KL-Cov, and GPPO preserve exploratory capacity by managing entropy reduction during training.
RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.
Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.
Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.
Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.