Is distribution selection during RL the same compression mechanism as entropy collapse?
This explores whether two RL failure modes the corpus documents separately — RL picking one pretraining 'format' and burying the rest (distribution selection) vs. policy entropy shrinking toward a performance ceiling (entropy collapse) — are actually the same underlying narrowing, or two different things that look alike.
This explores whether "distribution selection" and "entropy collapse" name one mechanism or two. The corpus suggests they're cousins, not twins — both are RL narrowing the space of things a model will do, but they cut at different levels. Entropy collapse is about *concentration*: probability mass piles onto a few high-reward trajectories until the policy stops exploring, and there's an empirical law (R = -a·exp(H) + b) showing performance saturates as entropy approaches zero Does policy entropy collapse limit reasoning performance in RL?. Distribution selection is about *which mode wins*: controlled experiments show RL amplifies a single dominant format that already existed in pretraining within the first epoch, while suppressing the alternatives — and strikingly, the winning format tracks model scale, not which format performs best Does RL training collapse format diversity in pretrained models?. One is a knob turning down; the other is a winner-take-all election among pre-existing candidates.
What ties them together is that both are sharpening, and the corpus keeps finding the *same* sharpening across very different settings. RL squeezes exploration diversity in search agents through what's described as the same entropy-collapse mechanism seen in reasoning Does reinforcement learning squeeze exploration diversity in search agents?, and outcome-based RL — rewarding only the final answer — concentrates probability onto correct trajectories so aggressively that the diversity loss spills from solved problems onto unsolved ones the model hasn't even cracked yet Does outcome-based RL diversity loss spread across unsolved problems?. That last finding is the bridge: global sharpening (entropy collapse) and the suppression of alternative formats (distribution selection) may be two views of one probability-mass redistribution.
The most interesting wrinkle is that this narrowing is *structured*, not random. Across seven RL algorithms and ten model families, RL updates only 5–30% of parameters — but those sparse updates are nearly full-rank and nearly identical across random seeds Does reinforcement learning update only a small fraction of parameters?. That consistency is exactly what you'd expect if RL is *selecting* a pre-existing structure rather than collapsing arbitrarily — it lines up better with the distribution-selection picture than with a blind entropy drain.
And the two diverge where it counts: directionality. Entropy collapse is monotone — entropy goes down, full stop. But preference tuning's diversity effect *reverses* by domain — RLHF reduces diversity in code (which rewards converging on the one correct solution) but *increases* it in creative writing (which rewards standing out) Does preference tuning always reduce diversity the same way?. A pure compression mechanism can't increase diversity; a selection mechanism that picks whatever the reward favors can. That's the cleanest evidence they aren't identical.
If you want the broader frame, there's a separate line arguing LLMs are compression machines by nature — they maximize statistical compression where humans preserve nuance for situated meaning Do LLMs compress concepts more aggressively than humans do?. Read alongside the RL findings, it hints that entropy collapse might be the training-time face of a compression bias the architecture already carries — while distribution selection is what that bias does when a reward signal hands it a target to aim at.
Sources 7 notes
Empirical law R = -a·exp(H) + b shows performance saturates when policy entropy approaches zero. Interventions like Clip-Cov, KL-Cov, and GPPO preserve exploratory capacity by managing entropy reduction during training.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.
RL that rewards only final answer correctness sharpens the policy globally, concentrating probability mass on correct trajectories for solved problems while simultaneously reducing diversity on unsolved ones. Historical exploration (training diversity via UCB-style bonuses) and batch exploration (test-time diversity via repetition penalties) require structurally different mechanisms.
Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.
RLHF reduces lexical-syntactic diversity in code generation but increases it in creative writing. The direction depends on what each domain incentivizes: code rewards convergence toward correct solutions, while creative writing rewards stylistic distinctiveness.
Using Rate-Distortion Theory on cognitive datasets, LLMs capture broad category structure but lose fine-grained distinctions humans preserve. LLMs maximize compression efficiency; humans trade compression for contextual meaning that enables situated action.