Does RL training collapse format diversity in pretrained models?
Exploring whether RL fine-tuning systematically selects one output format from pretraining while suppressing others, and how this selection mechanism drives performance gains.
A study with full pretraining transparency (models pretrained from scratch on known open datasets) reveals a striking structural pattern: RL fine-tuning does not simply improve reasoning — it systematically selects for and amplifies a single format from the pretraining mixture while collapsing all others.
The mechanism: early in RL training (within the first epoch), the model shifts toward generating outputs in the format of one specific distribution — code-like formats for smaller models, natural language formats for larger models. This transition coincides with the largest accuracy gain, suggesting the selection of a dominant format is what drives improvement, not a gradual enhancement across all formats.
Key findings:
- The dominant distribution is typically the most performant — RL selects for the format in which the base model is already strongest
- Scale-dependent bias — smaller models favor simpler, code-like formats; larger models shift toward natural language
- The amplification degree depends on KL penalty — looser KL constraints produce more extreme format collapse
- RL does not always favor the most common distribution — pretraining proportions predict which distribution "wins" only sometimes
This is distinct from Does policy entropy collapse limit reasoning performance in RL? in an important way. Entropy collapse describes diversity reduction within an output distribution. The echo chamber finding describes distribution selection: RL picks one distribution and amplifies it at the expense of all others. It is a format-level convergence, not just a diversity-level collapse.
The implication for practitioners: RL fine-tuning results depend on what the pretraining data mixture looks like, but this dependence is largely hidden when starting from existing pretrained models whose training data is proprietary. The performance gains attributed to RL algorithms may partially reflect which pretraining distribution was selected, not algorithmic superiority.
Inquiring lines that use this note as a source 252
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- What happens when models train on AI-generated content recursively?
- Why do different AI models generate similar outputs independently?
- Why does RLHF alignment reduce the diversity of viewpoints in AI output?
- What role does rigid output format play in function calling failure modes?
- How do unstated constraints become invisible to training data distributions?
- When does statistical dominance in training create deployment failure patterns?
- Can dataset-level debiasing methods fix popularity bias inherited from pretraining?
- How do different LLM integration paradigms affect inheritance of pretraining biases?
- How do pretraining biases interact differently with prompts across model tiers?
- What training signals would models need to learn reciprocal common-ground construction?
- Can few-shot examples narrow generative diversity in creative tasks?
- Can world models form from aggregated partial information across training distributions?
- Why do proprietary models improve with training while open-source models decline?
- How much RLVR improvement comes from benchmark data memorization?
- Why does online RL succeed where supervised training fails for self-correction?
- Can fine-tuning or RLHF alone solve the persona distortion problem?
- Can distillation methods extract directional guidance that scalar RL cannot access?
- How does non-reasoning SFT prevent overfitting before RL training begins?
- Does alignment training create bidirectional instruction and response mappings?
- Why does self-generated training data outperform externally sourced data?
- What happens when a single loss function conflates representation learning with decision-making?
- Why does asymmetric self-play create naturally calibrated difficulty better than fixed curricula?
- What failure modes emerge when model-generated content trains on itself iteratively?
- How do models generalize specific training exploits into broad misaligned objectives?
- How do training objectives shape what a world model actually learns?
- Why does RLVR increase token entropy while decreasing answer diversity?
- Can feature disentanglement in gesture synthesis generalize to completely unseen voice distributions?
- Does narrow reallocation to remaining tasks constitute genuine adaptation?
- How do early layers preserve unbiased information while late layers conform?
- Does the model learn depth-wise drift as an explicit strategy?
- When does natural context diversity reduce the need for explicit exploration?
- How do different training objectives shift whether models over-predict or under-predict?
- How much does pretraining contribute to ToM performance versus task-specific training?
- How does distributional distance from pre-training relate to model difficulty?
- Why does training data format matter more than domain content?
- Why do zero-advantage rollouts destabilize training beyond just wasting compute?
- Can accelerated sampling techniques from image generation speed up evolutionary search?
- Why does AI output show diversity without multiplying actual points of view?
- Can in-context learning replicate the timing effects that RL teaches models?
- How does covariate diversity compare to the exploration assumptions of LinUCB?
- Does RLHF training suppress exploratory and qualifying language?
- How do surface statistical regularities enable correct outputs while degrading robustness?
- Why does NLI fine-tuning amplify frequency bias instead of teaching inference?
- Can the serving loop itself become the primary training data source?
- How does preference-based training compare to supervised fine-tuning for function calling?
- Do different function-calling subtasks have different entropy profiles during training?
- How do ensemble methods apply within a single model?
- Why does low temperature sampling extract consensus from diverse training data?
- How does training-time voting differ from inference-time majority voting over samples?
- What conditions make training diversity better than individual expert quality?
- How does self-distillation differ from standard fine-tuning approaches?
- Why does training data format matter more than its domain content?
- Why does mixed instruction data sometimes hurt specific model capabilities?
- What makes output convergence across models inevitable given input-side homogenization?
- What capabilities actually require massive scale versus specialized training regimes?
- How does mutual shaping through diverse training compare to population-level diversity effects?
- Why does context information fail to override prior training associations?
- Can structured output formats reduce instruction following degradation?
- Does training data format shape model reasoning more than domain content?
- How does reinforcement learning compare to differentiable joint training for RAG?
- Why does curriculum learning with tight budgets beat fixed-budget approaches?
- Why does fine-tuning change how models process retrieved context?
- Why do small training data contaminations persist through alignment for most attack types?
- How much can mitigation techniques like augmentation reduce priming without harming learning?
- How does behavioral fine-tuning differ from factual knowledge encoding in models?
- Does fine-tuning on NLI tasks reduce or amplify frequency bias?
- Can unsupervised confidence-based training scale to domains beyond human evaluation reach?
- Why does fine-tuning improve some capabilities while degrading others?
- How should researchers evaluate whether correct model outputs reflect real structural learning?
- Which AI imaginaries dominate training data and shape system behavior most strongly?
- How does training data distribution determine what models can learn?
- Why does fine-tuning fail to remove temporal contamination from pretraining?
- How does training frequency distribution shape what models reliably retrieve?
- Do instruction-tuned models learn tasks or just output format distributions?
- When should full-parameter post-training be used instead of LoRA adaptation?
- Does reasoning trace style explain why RL post-training improves model reasoning?
- What causes irreversible model collapse when training on model-generated content?
- Can backward transfer measurements reliably predict optimal multi-task training order?
- Why did prior multi-token prediction methods fail during fine-tuning?
- How could persona vector tracking complement multi-turn RL for earlier drift detection?
- Can suppressing incorrect behavior alone solve the diversity bottleneck in reasoning RL?
- Can RLHF training push models away from human-like lexical patterns?
- Can messy multi-agent transcripts become better training data than clean outputs?
- What distinguishes training-time entropy collapse from test-time variance inflation?
- How does training data format shape whether models reason in parallel or sequentially?
- Does pre-training encode personality patterns that fine-tuning later activates?
- Why do smaller and larger models converge on different output formats?
- How does inference variance differ from training entropy collapse?
- Can diversity-aware RL objectives prevent format convergence?
- Why do production systems optimize for three model classes instead of foundation models?
- What role does KL penalty strength play in format selection?
- Why do production teams choose expensive frontier models over fine-tuning?
- Does foundational model training or user priors more strongly shape final outputs?
- Why does consistency training make models resistant to prompt perturbations?
- Does removing cognitive bias from training signals accidentally break what makes alignment work?
- Does fine-tuning actually change model capabilities or only output distribution?
- Can synthetic data generation balance all three QDC axes simultaneously?
- What creates the irreducible trade-off between quality and diversity in training data?
- How does diversity loss in synthetic data mirror tail distribution disappearance?
- Why does combining reasoning distillation with RLVR outperform either training stage alone?
- Can smaller models achieve domain expertise through focused RL training?
- How does RL compress reasoning path diversity during training?
- Which recipe choices determine the asymptotic ceiling in RL training?
- Does self-generated training data reduce a model's capability diversity?
- How do RL subnetworks identified from different random seeds compare?
- Does sparsity in RL arise from training on policy-distribution data?
- How does KL penalty strength affect the degree of format collapse during RL?
- Can RL format selection explain performance gains attributed to algorithmic improvements?
- Is distribution selection during RL the same compression mechanism as entropy collapse?
- Why does RL improve sampling efficiency but not expand capability boundaries?
- How does behavior cloning reduce complexity before RL training in rerankers?
- Does negative reinforcement alone achieve what full RL training accomplishes?
- Why does long CoT training optimize for structural coherence over content correctness?
- Can proper scoring rules fix RLVR's degradation on disagreement prediction?
- What role do high-entropy minority tokens play in RLVR?
- What limits RLVR effectiveness beyond mathematical and coding domains?
- How do quality, diversity, and complexity create different effects on downstream model performance?
- How does pretrained knowledge constrain what adaptation strategies can achieve?
- Why do pretrained model priors reduce the usefulness of retrieved experience?
- Does training data format matter more than who generates it?
- How does diversity collapse during iterative self-improvement cycles?
- How does representational convergence differ from policy entropy collapse in iterative training?
- How does task-oriented fine-tuning compare to preference tuning methods?
- Why does training order matter across different domain types?
- Can models converge on similar experience descriptions across different architectures?
- Can self-training drift be prevented by applying student compatibility filtering?
- Why does post-training suppress alignment faking in some models but amplify it in others?
- Why do self-consistency methods fail where pretraining bias is strongest?
- What signals detect when consensus training is silently degrading performance?
- What neural or architectural mechanism allows selective override of frequency effects?
- Does the Assistant Axis exist in pre-trained models before instruction tuning?
- Can shifting the accuracy metric itself eliminate the need for diversity post-processing?
- How does training data format shape which reasoning patterns emerge in models?
- Does RLVR expand model capability or reorganize existing capability?
- What makes pretraining composition more important than reward engineering?
- How do RL training and base models differ in creating MI peaks?
- Do high-entropy RLVR tokens correspond to MI-peak tokens during inference?
- Do negative constraints require fundamentally different training signals than positive instructions?
- Can preference learning fix the rigid output format problem better than supervised training?
- Does format-based pretraining determine how models respond to reinforcement learning?
- Does critique training improve exploration diversity during model training or only test time?
- Does training data format determine whether models collapse entropy or inflate variance?
- Does environment stochasticity force models to generalize better across trajectory variations?
- How do trained weights differ from a stored library or text?
- How does repeated content shift model outputs across multiple turns?
- Why does better RLHF training fail to decouple polish from persona distortion?
- How do gradients flowing through both branches simultaneously reshape each component's role?
- Can the joint-training principle extend beyond memorization and generalization pairs?
- How do self-evolving curricula help RL break beyond base model capability boundaries?
- Can alignment training create systematic blind spots in threat detection systems?
- Can training format itself shape what reasoning strategy a model learns?
- What happens when you project the same model onto different harnesses?
- Does weight decay directly cause contractive behavior near training examples?
- What makes data augmentation an implicit form of contraction learning?
- How does adversarial collapse threaten unsupervised self-play skill construction?
- Why does eliminating proxy-model filtering improve reasoning emergence in pretraining?
- Does pretraining poisoning at scale persist through instruction alignment?
- How do reward signals in RLVR interact with pretraining biases?
- What happens to base model capabilities when you apply finetuning?
- How do retrieval and fine-tuning trade off flexibility against training cost?
- How should skill libraries coordinate with gradient-based weight optimization?
- Why do queries with low cross-rollout variance produce degenerate gradients?
- Can skill libraries prevent redundant narrow artifacts from proliferating?
- How does post-training shift models from passive prediction to on-policy action?
- What alignment procedures cause different models to share the same output distribution?
- Can explicitly optimizing for semantic diversity during RL training improve both quality and variation?
- Why does preference tuning reduce diversity in code but increase it in creative tasks?
- What happens to model grounding when preference optimization increases effective diversity?
- Why does the order of training examples matter for what models learn?
- How do encode-decode contractive biases create stable attractors in latent space?
- How does upstream value embedding differ from downstream alignment patches?
- Why do sparse parameter subsets enable full-rank learning in RL?
- How do pre-training and distillation enable minimal routing signals to work?
- Where does skill extraction fail compared to genuine model adaptation?
- Does RL training activate latent meta-learning capacity or create it from scratch?
- What scaling properties emerge from RL training dynamics beyond verification?
- Why should deep learning theory prioritize average-case over worst-case analysis?
- What solvable idealized settings reveal fundamental phenomena in realistic deep learning?
- How does KL regularization prevent both forgetting and adaptation loss?
- How does curriculum learning prevent instability in social-emotional RL training?
- How much does pretraining quality affect the modularity of fine-tuned models?
- How does entropy loss enable exploration beyond a single training example?
- How does on-policy entropy recognition differ from training-time entropy collapse?
- Can group-relative normalization be modified to resist shortcut trajectories?
- How should multi-objective post-training balance competing behavioral goals?
- What is the behavioral signature of a model tracking input surprise?
- What is the difference between changing model outputs versus changing internal representations?
- Can mechanistic interpretability tools decode the biases alignment training conceals?
- Why do overtrained domains show different RL training outcomes than novel tasks?
- What makes a learned consolidation rule lossy and where does contamination enter?
- What's the difference between RLHF, RLVR, and RLCF as training paradigms?
- What makes supervised fine-tuning worsen RL exploration later?
- Why do six different RLVR algorithms converge on similar performance levels?
- How does prolonged RL training differ from standard RLVR approaches?
- Can entropy regularization or critique models prevent search strategy collapse during RL training?
- Why does supervised fine-tuning on diverse demonstrations expand exploration diversity compared to RL?
- Does the pretrained model prior limit RL search capability more than the optimization algorithm itself?
- When does RLHF reduce diversity and when does it preserve semantic variation?
- How does stage-wise training scheduling resolve conflicts between constraint-following and creative tasks?
- Can specialized components replace single fully-trained models in deployment?
- What features does a sample reinforce when it moves bands?
- What mechanisms cause overly hard samples to degrade prior model performance?
- Why do certain tokens at certain difficulties drive most of RLVR's learning signal?
- How do verifier-free RL patterns differ from traditional RLHF approaches?
- Can trained models encode programs more complex than their data-generating process?
- Why do preference-tuned models produce different diversity patterns in code versus creative writing?
- How does probability mass concentration affect sampling diversity across model scales?
- At what point does output quality outweigh diversity value in synthetic data tasks?
- What output distribution properties make smaller models better for wide sampling?
- How does in-weights adaptation create spurious forgetting in models?
- Does RLVR teach new reasoning or activate existing pretraining capabilities?
- What pretraining formats encode latent reasoning strategies that RLVR can surface?
- Does careful reward engineering matter if pretraining determines RLVR effectiveness?
- Do text-space skills transfer learning across different frontier models?
- What training regimes confound surface mechanisms with their actual causes?
- How do sparse parameter updates enable when-not-how training to work?
- Can single-problem fine-tuning match full RL pipeline reasoning gains?
- Why does the pretrained prior determine the exploration ceiling?
- Why does gradient discarding limit standard policy clipping?
- Why does reinforcement learning training degrade model calibration?
- Why does outcome-based RL specifically lose diversity during training?
- Can RL directly optimize attention distributions instead of text generation?
- Does semantic diversity in output space compete with reward-component diversity?
- Why should scaling laws be understood as properties of data distribution rather than training in general?
- How much does diversity training cost in single-shot pass@1 performance?
- Can experimental outcomes be reliably distilled into reusable insights?
- Why does diversity in LLM outputs mask sampling from community priors?
- How does pretraining determine what RL can later teach a model?
- Why do unified models still inherit data-distribution biases from training?
- Does alignment compound cultural bias that started during pretraining?
- What makes trajectory quality matter more than one-shot task success?
- How do open-world evaluations correct distortions that automated benchmarks introduce?
- Can models generate their own training curriculum during offline dreaming?
- Which finetuning method works best across different task and data regimes?
- How do finetuning and pretraining improvements differ in their effects on model capabilities?
- Can this whole-artifact principle apply to other generative tasks?
- How do task frequency and complexity interact with model capacity during training?
- How do normalization and input injection control emergence of fixed points?
- Does verbalized sampling preserve factual accuracy and safety during diversity gains?
- Can decoding-time prompting strategies fully replace diversity-focused training methods?
- Do sample-level similarities between pretraining and downstream tasks explain the frequency effect?
- How do weight visualizations reveal temporal structure in cyclic training?
- Can training order and structure shape what networks retain and learn?
- How does model scale affect anticipatory behavior in structured training?
- Can data pruning and equal contribution be reconciled in optimal learning?
- Does latent density emerge during pretraining from training data familiarity?
- How does active selection of training content differ from random reinforcement sampling?
- Can models trained with RL on pretraining data avoid reward hacking seen in RLHF?
- Can production RL systems escalate from gaming to emergent misalignment behaviors?
- How much performance is lost when converting pretrained checkpoints versus training from scratch?
- How do complexity and diversity affect model performance differently?
- Does finetuning facts into weights overwrite existing model capabilities?
Related concepts in this collection 6
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does policy entropy collapse limit reasoning performance in RL?
As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
entropy collapse is the within-distribution consequence; this note is the between-distribution mechanism
-
Why do reasoning models fail differently at training versus inference?
Reasoning models exhibit two distinct failure modes—entropy collapse during training and variance inflation during inference—that appear unrelated but may share underlying causes. Understanding these dual problems could reveal whether separate or unified solutions are needed.
adds a third layer: not just entropy collapse and variance inflation, but distribution selection
-
Can simple rewards alone teach complex domain reasoning?
Does reinforcement learning on difficult problems with basic accuracy rewards produce sophisticated reasoning strategies without explicit chain-of-thought training? This challenges assumptions about what domain AI models need to learn effectively.
emergence through RL looks different when the pretraining mixture is known: it's partly selection, not purely emergence
-
Does RL improve domain reasoning by adding knowledge or removing it?
When reinforcement learning improves reasoning in specialized domains like medicine, is it teaching models new facts or preventing them from using wrong ones? Understanding this distinction matters for how we design RL training.
pruning operates within the selected distribution; this note shows which distribution gets to keep its knowledge
-
Does reinforcement learning squeeze exploration diversity in search agents?
Investigates whether RL training narrows the behavioral diversity of search agents the same way it does in reasoning tasks. Understanding this mechanism could reveal whether entropy collapse is fundamental to RL or domain-specific.
confirms the echo chamber dynamic is domain-general: RL squeezes search strategy diversity just as it selects a single pretraining format — format selection and within-format entropy collapse are two levels of the same RL compression
-
Why does RLVR training narrow a model's problem solving ability?
RLVR's on-policy constraint may force models to exploit known reasoning paths rather than explore new ones, potentially shrinking their effective problem-solving scope. Understanding this mechanism could reveal how to design better exploration incentives in language model reasoning.
capability boundary collapse is the downstream consequence of format selection: when RL selects one dominant distribution, problems solvable only through suppressed formats become unreachable
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining
- The Art of Scaling Reinforcement Learning Compute for LLMs
- Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs
- Planted in Pretraining, Swayed by Finetuning: A Case Study on the Origins of Cognitive Biases in LLMs
- 1000 Layer Networks for Self-Supervised RL: Scaling Depth Can Enable New Goal-Reaching Capabilities
- Reinforcement Learning for Reasoning in Large Language Models with One Training Example
- Stop Anthropomorphizing Intermediate Tokens as Reasoning/Thinking Traces!
- SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
Original note title
rl post-training converges on a single dominant pretraining distribution format, suppressing all others