Is distribution selection during RL the same compression mechanism as entropy collapse?

This explores whether two RL failure modes the corpus documents separately — RL picking one pretraining 'format' and burying the rest (distribution selection) vs. policy entropy shrinking toward a performance ceiling (entropy collapse) — are actually the same underlying narrowing, or two different things that look alike.

This explores whether "distribution selection" and "entropy collapse" name one mechanism or two. The corpus suggests they're cousins, not twins — both are RL narrowing the space of things a model will do, but they cut at different levels. Entropy collapse is about *concentration*: probability mass piles onto a few high-reward trajectories until the policy stops exploring, and there's an empirical law (R = -a·exp(H) + b) showing performance saturates as entropy approaches zero Does policy entropy collapse limit reasoning performance in RL?. Distribution selection is about *which mode wins*: controlled experiments show RL amplifies a single dominant format that already existed in pretraining within the first epoch, while suppressing the alternatives — and strikingly, the winning format tracks model scale, not which format performs best Does RL training collapse format diversity in pretrained models?. One is a knob turning down; the other is a winner-take-all election among pre-existing candidates.

What ties them together is that both are sharpening, and the corpus keeps finding the *same* sharpening across very different settings. RL squeezes exploration diversity in search agents through what's described as the same entropy-collapse mechanism seen in reasoning Does reinforcement learning squeeze exploration diversity in search agents?, and outcome-based RL — rewarding only the final answer — concentrates probability onto correct trajectories so aggressively that the diversity loss spills from solved problems onto unsolved ones the model hasn't even cracked yet Does outcome-based RL diversity loss spread across unsolved problems?. That last finding is the bridge: global sharpening (entropy collapse) and the suppression of alternative formats (distribution selection) may be two views of one probability-mass redistribution.

The most interesting wrinkle is that this narrowing is *structured*, not random. Across seven RL algorithms and ten model families, RL updates only 5–30% of parameters — but those sparse updates are nearly full-rank and nearly identical across random seeds Does reinforcement learning update only a small fraction of parameters?. That consistency is exactly what you'd expect if RL is *selecting* a pre-existing structure rather than collapsing arbitrarily — it lines up better with the distribution-selection picture than with a blind entropy drain.

And the two diverge where it counts: directionality. Entropy collapse is monotone — entropy goes down, full stop. But preference tuning's diversity effect *reverses* by domain — RLHF reduces diversity in code (which rewards converging on the one correct solution) but *increases* it in creative writing (which rewards standing out) Does preference tuning always reduce diversity the same way?. A pure compression mechanism can't increase diversity; a selection mechanism that picks whatever the reward favors can. That's the cleanest evidence they aren't identical.

If you want the broader frame, there's a separate line arguing LLMs are compression machines by nature — they maximize statistical compression where humans preserve nuance for situated meaning Do LLMs compress concepts more aggressively than humans do?. Read alongside the RL findings, it hints that entropy collapse might be the training-time face of a compression bias the architecture already carries — while distribution selection is what that bias does when a reward signal hands it a target to aim at.

Sources 7 notes

Does policy entropy collapse limit reasoning performance in RL?

Empirical law R = -a·exp(H) + b shows performance saturates when policy entropy approaches zero. Interventions like Clip-Cov, KL-Cov, and GPPO preserve exploratory capacity by managing entropy reduction during training.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Does reinforcement learning squeeze exploration diversity in search agents?

RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.

Does outcome-based RL diversity loss spread across unsolved problems?

RL that rewards only final answer correctness sharpens the policy globally, concentrating probability mass on correct trajectories for solved problems while simultaneously reducing diversity on unsolved ones. Historical exploration (training diversity via UCB-style bonuses) and batch exploration (test-time diversity via repetition penalties) require structurally different mechanisms.

Does reinforcement learning update only a small fraction of parameters?

Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.

Does preference tuning always reduce diversity the same way?

RLHF reduces lexical-syntactic diversity in code generation but increases it in creative writing. The direction depends on what each domain incentivizes: code rewards convergence toward correct solutions, while creative writing rewards stylistic distinctiveness.

Do LLMs compress concepts more aggressively than humans do?

Using Rate-Distortion Theory on cognitive datasets, LLMs capture broad category structure but lose fine-grained distinctions humans preserve. LLMs maximize compression efficiency; humans trade compression for contextual meaning that enables situated action.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an RL mechanicist auditing a tension between two narrowing phenomena in LLMs. The core question remains open: does *entropy collapse* (probability mass concentrating onto high-reward modes, saturating performance) and *distribution selection* (RL amplifying a single pre-existing format while suppressing alternatives) name one causal process or two structurally distinct ones?

What a curated library found — and when (dated claims, not current truth):
- Entropy collapse shows empirical saturation: R = −a·exp(H) + b, where performance plateaus as entropy H→0 (2025-05, arXiv:2505.22617).
- Distribution selection picks a single dominant pretraining format within epoch 1, tracking model scale not reward quality; RL amplifies this winner while suppressing alternatives (2025-04, arXiv:2504.07912).
- RL updates only 5–30% of parameters, yet these updates are nearly full-rank and nearly identical across seeds — suggesting *structural selection* rather than arbitrary collapse (2025-05, arXiv:2505.11711).
- Preference tuning's diversity effect is domain-dependent: RLHF *reduces* diversity in code but *increases* it in creative writing (2025-04, arXiv:2504.12522).
- Outcome-based RL concentrates probability onto correct trajectories so aggressively that diversity loss transfers from solved to unsolved problems (2025-09, arXiv:2509.06941).

Anchor papers (verify; mind their dates):
- arXiv:2505.22617 (2025-05): The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models
- arXiv:2504.07912 (2025-04): Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining
- arXiv:2505.11711 (2025-05): Reinforcement Learning Finetunes Small Subnetworks in Large Language Models
- arXiv:2505.17117 (2025-05): From Tokens to Thoughts: How LLMs and Humans Trade Compression for Meaning

Your task:
(1) RE-TEST THE TENSION. For each finding above, ask: has newer tooling (e.g., interventional causality, mechanistic probes, or finer-grained parameter tracking), ensemble or multi-reward training, or adaptive entropy regularization since *relaxed* the monotonicity of entropy collapse or *overturned* the claim that distribution selection is pre-existing? Separate the durable question (are these one mechanism?) from perishable constraints (entropy must decrease; format selection must precede RL). Cite what resolved each.
(2) Surface the strongest *contradicting* work from the last 6 months: papers showing entropy collapse and distribution selection *do* unify under a single principle, or showing they operate at truly orthogonal levels such that conflating them is a category error.
(3) Propose 2 research questions that assume the regime may have shifted: (a) Can you decouple entropy collapse from format selection using reward orthogonalization or multi-objective RL? (b) If RL always selects pre-existing structure, does that imply pretraining itself is doing distribution selection, and are the two processes identical in mechanism but different in time scale?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Is distribution selection during RL the same compression mechanism as entropy collapse?

Sources 7 notes

Next inquiring lines