Does RL training collapse format diversity in pretrained models?

Exploring whether RL fine-tuning systematically selects one output format from pretraining while suppressing others, and how this selection mechanism drives performance gains.

Synthesis note · 2026-02-22 · sourced from Reasoning Critiques

A study with full pretraining transparency (models pretrained from scratch on known open datasets) reveals a striking structural pattern: RL fine-tuning does not simply improve reasoning — it systematically selects for and amplifies a single format from the pretraining mixture while collapsing all others.

The mechanism: early in RL training (within the first epoch), the model shifts toward generating outputs in the format of one specific distribution — code-like formats for smaller models, natural language formats for larger models. This transition coincides with the largest accuracy gain, suggesting the selection of a dominant format is what drives improvement, not a gradual enhancement across all formats.

Key findings:

The dominant distribution is typically the most performant — RL selects for the format in which the base model is already strongest
Scale-dependent bias — smaller models favor simpler, code-like formats; larger models shift toward natural language
The amplification degree depends on KL penalty — looser KL constraints produce more extreme format collapse
RL does not always favor the most common distribution — pretraining proportions predict which distribution "wins" only sometimes

This is distinct from Does policy entropy collapse limit reasoning performance in RL? in an important way. Entropy collapse describes diversity reduction within an output distribution. The echo chamber finding describes distribution selection: RL picks one distribution and amplifies it at the expense of all others. It is a format-level convergence, not just a diversity-level collapse.

The implication for practitioners: RL fine-tuning results depend on what the pretraining data mixture looks like, but this dependence is largely hidden when starting from existing pretrained models whose training data is proprietary. The performance gains attributed to RL algorithms may partially reflect which pretraining distribution was selected, not algorithmic superiority.

Inquiring lines that use this note as a source 252

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 6

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

16 direct connections · 151 in 2-hop network ·dense cluster Open in graph ↗

Does RL training collapse format diversity in pr… Does policy entropy collapse limit reasoning perfo… Why do reasoning models fail differently at traini… Can simple rewards alone teach complex domain reas… Does RL improve domain reasoning by adding knowled… Does reinforcement learning squeeze exploration di… Why does RLVR training narrow a model's problem so…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does policy entropy collapse limit reasoning performance in RL? As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
entropy collapse is the within-distribution consequence; this note is the between-distribution mechanism
Why do reasoning models fail differently at training versus inference? Reasoning models exhibit two distinct failure modes—entropy collapse during training and variance inflation during inference—that appear unrelated but may share underlying causes. Understanding these dual problems could reveal whether separate or unified solutions are needed.
adds a third layer: not just entropy collapse and variance inflation, but distribution selection
Can simple rewards alone teach complex domain reasoning? Does reinforcement learning on difficult problems with basic accuracy rewards produce sophisticated reasoning strategies without explicit chain-of-thought training? This challenges assumptions about what domain AI models need to learn effectively.
emergence through RL looks different when the pretraining mixture is known: it's partly selection, not purely emergence
Does RL improve domain reasoning by adding knowledge or removing it? When reinforcement learning improves reasoning in specialized domains like medicine, is it teaching models new facts or preventing them from using wrong ones? Understanding this distinction matters for how we design RL training.
pruning operates within the selected distribution; this note shows which distribution gets to keep its knowledge
Does reinforcement learning squeeze exploration diversity in search agents? Investigates whether RL training narrows the behavioral diversity of search agents the same way it does in reasoning tasks. Understanding this mechanism could reveal whether entropy collapse is fundamental to RL or domain-specific.
confirms the echo chamber dynamic is domain-general: RL squeezes search strategy diversity just as it selects a single pretraining format — format selection and within-format entropy collapse are two levels of the same RL compression
Why does RLVR training narrow a model's problem solving ability? RLVR's on-policy constraint may force models to exploit known reasoning paths rather than explore new ones, potentially shrinking their effective problem-solving scope. Understanding this mechanism could reveal how to design better exploration incentives in language model reasoning.
capability boundary collapse is the downstream consequence of format selection: when RL selects one dominant distribution, problems solvable only through suppressed formats become unreachable

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

rl post-training converges on a single dominant pretraining distribution format, suppressing all others

Does RL training collapse format diversity in pretrained models?

Related concepts in this collection 6

Related papers in this collection 8

Search by related questions 4