How does representational convergence differ from policy entropy collapse in iterative training?

This explores two things that both look like 'the model narrowing down' during repeated training rounds, but aren't the same: policy entropy collapse is the action distribution losing its spread of choices, while representational convergence is the model settling onto one internal format or representation among several it could have used.

This explores two failure-adjacent dynamics that both look like 'the model narrowing' during iterative training, but operate on different layers. Policy entropy collapse is about *behavior*: the distribution over what the model does. As RL training proceeds, the policy concentrates on a few reward-maximizing moves and stops exploring alternatives. The corpus pins this down with an unusually clean empirical law — performance saturates as entropy approaches zero, R = -a·exp(H) + b — and frames it as the primary ceiling on RL scaling for reasoning Does policy entropy collapse limit reasoning performance in RL?. The same squeeze shows up beyond reasoning: search agents lose behavioral diversity through the identical entropy-collapse mechanism, converging on narrow strategies Does reinforcement learning squeeze exploration diversity in search agents?.

Representational convergence is about *form*: which of several available internal styles or output formats the model commits to. Here the striking corpus result is that RL doesn't invent a new format — it amplifies one distribution already present from pretraining within the first epoch and suppresses the alternatives, and the winner is decided by model scale rather than by which format performs best Does RL training collapse format diversity in pretrained models?. So the convergence is a selection among pre-existing representations, not a loss of exploratory probability mass. That distinction matters: entropy collapse is a continuous narrowing you can measure and counteract (Clip-Cov, KL-Cov, GPPO all manage the rate of entropy reduction); format convergence is closer to a winner-take-all tipping point baked in early.

The two also have different relationships to what's reversible. Entropy collapse is partly a training-dynamics problem — SFT on diverse demonstrations restores exploration breadth that RL squeezed out Does reinforcement learning squeeze exploration diversity in search agents?, and keeping the policy close to its base distribution (low KL drift) preserves the model's plasticity to keep learning new tasks instead of stalling when the domain shifts Does staying close to the base model preserve learning ability?. Representational structure, by contrast, is laid down more in how the network organizes itself: networks learn dense activations for familiar data and stay sparse for the unfamiliar Is representational sparsity learned or intrinsic to neural networks?, and they decompose tasks into modular subnetworks that pretraining makes more consistent Do neural networks naturally learn modular compositional structure?. That representational scaffolding is what gets *selected from* when a format wins — it's the substrate, not the behavioral knob.

What ties them together is that iterative training has a phase structure, and the two phenomena dominate at different moments. RL training moves through a first phase where execution correctness drives learning and a second where strategic planning becomes the bottleneck — and tellingly, planning-token entropy *rises* while execution entropy stabilizes Does RL training follow a predictable two-phase learning sequence?. So 'entropy collapse' isn't a uniform fate across the whole model; some channels collapse (execution) while others need to stay open (planning). The thing worth walking away with: collapse and convergence aren't synonyms for the same decay. One is the policy spending its exploration budget; the other is the model committing to one of several inherited ways of representing the problem — and the interventions that fix one (entropy regularizers, SFT refresh, KL anchoring) don't touch the other.

Sources 7 notes

Does policy entropy collapse limit reasoning performance in RL?

Empirical law R = -a·exp(H) + b shows performance saturates when policy entropy approaches zero. Interventions like Clip-Cov, KL-Cov, and GPPO preserve exploratory capacity by managing entropy reduction during training.

Does reinforcement learning squeeze exploration diversity in search agents?

RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Does staying close to the base model preserve learning ability?

FST-trained models stay up to 70% closer to their base distribution than parameter-only RL, and this reduced drift preserves the model's ability to learn subsequent tasks effectively. Parameter-only approaches stall when task domains change, while low KL drift enables sustained adaptation.

Is representational sparsity learned or intrinsic to neural networks?

During pretraining, neural networks develop dense activations for familiar training data and default to sparse representations for unfamiliar inputs. This trend emerges without task-specific fine-tuning and reflects how models consolidate knowledge through exposure.

Do neural networks naturally learn modular compositional structure?

Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about iterative training dynamics in LLMs. The question remains open: how do representational convergence and policy entropy collapse differ as failure modes, and can they be decoupled?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat as perishable snapshots:

• Policy entropy collapse follows an empirical law (R ≈ −a·exp(H) + b) and is the primary ceiling on RL scaling for reasoning (~2025).
• RL amplifies a single pretraining distribution format within the first epoch; the winner is decided by model scale, not performance (~2025).
• SFT on diverse demonstrations *restores* exploration breadth that RL squeezed out; low KL drift from the base model preserves plasticity for continual learning (~2025).
• RL training exhibits a two-phase dynamic: execution entropy collapses while planning-token entropy *rises*, suggesting entropy collapse is not uniform across the model (~2025).
• Representational density is learned through training-data familiarity; unfamiliar data stays sparse. Networks decompose compositional tasks into modular subnetworks (~2023, ~2025).

Anchor papers (verify; mind their dates):
- arXiv:2504.07912 (Echo Chamber, ~2025): RL post-training as amplification, not invention.
- arXiv:2505.22617 (~2025): Entropy as mechanism in RL for reasoning.
- arXiv:2605.22817 (Vector Policy Optimization, ~2026): Training for diversity improves test-time search.
- arXiv:2605.12484 (Learning, Fast and Slow, ~2026): Continual adaptation in LLMs.

Your task:
(1) RE-TEST THE DECOUPLING. For each finding, judge whether newer models (o1, o3-scale reasoning agents), training methods (process reward models, outcome-supervised RL), or orchestration (multi-agent debate, chain-of-thought caching) have since *relaxed* entropy collapse, accelerated format convergence, or blurred the distinction. Does the empirical law still hold? Do diverse-demonstration SFT interventions still restore breadth? Flag where the constraint appears to hold and where it may have dissolved.
(2) Surface work from the last 6 months that contradicts the two-phase dynamic or shows entropy rising *during* execution (not planning), or evidence that format convergence is *not* scale-dependent.
(3) Propose 2 research questions that assume the regime may have shifted: (a) Can we engineer representational multiplicity *after* format selection to recover lost diversity without retraining? (b) Do multi-task continual-learning setups with low KL drift reset the convergence clock, or does the pretraining format reassert itself?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

How does representational convergence differ from policy entropy collapse in iterative training?

Sources 7 notes

Next inquiring lines