Why does supervised fine-tuning on diverse demonstrations expand exploration diversity compared to RL?

This explores why training on a varied set of example demonstrations tends to keep a model's range of behaviors wide, while reinforcement learning narrows that range — and what mechanism drives the difference.

This explores why training on a varied set of example demonstrations tends to keep a model's range of behaviors wide, while reinforcement learning narrows it. The corpus points to a single underlying mechanism: RL is a convergence process and SFT is a coverage process. RL optimizes toward whatever maximizes reward, so policies collapse onto a narrow band of winning strategies — what's been documented as entropy collapse. This isn't unique to one task: it's been shown in reasoning, and then shown to repeat through the exact same mechanism in search agents Does reinforcement learning squeeze exploration diversity in search agents?. SFT on diverse demonstrations does the opposite — it spreads probability mass across all the behaviors the dataset displays, so breadth is preserved rather than competed away.

There's a sharper version of this story that's worth knowing: RL doesn't just narrow behavior, it tends to pick one winner from the model's existing repertoire and amplify it. Controlled experiments show RL converges on a single dominant format inherited from pretraining within the first epoch, actively suppressing the alternatives — and the format that 'wins' often depends on model scale rather than on which format performs best Does RL training collapse format diversity in pretrained models?. So RL's diversity loss is partly a consequence of how it concentrates an already-present distribution, not how it explores new ground. Relatedly, RL updates turn out to be structurally narrow at the parameter level too — only 5 to 30 percent of parameters move, in nearly identical subnetworks across seeds Does reinforcement learning update only a small fraction of parameters? — which is a fitting signature for a process that sharpens rather than broadens.

But the comparison is not 'SFT good, RL bad,' and this is where the corpus rewards going wider than the question's framing. The direction of the diversity effect depends heavily on what the domain rewards. Preference tuning reduces lexical diversity in code (where correctness pulls everything toward one answer) but *increases* it in creative writing (where distinctiveness is rewarded) Does preference tuning always reduce diversity the same way?. The same split shows up in entropy dynamics: structured domains systematically lose output entropy under training while open-ended domains gain it, and the order you train them in mechanically reshapes the outcome Does training order reshape how models handle different task types?. So 'RL squeezes diversity' is really 'RL squeezes diversity wherever the reward has a single peak.'

The more interesting finding is that diverse-demonstration SFT and RL aren't rivals so much as complementary phases — and diversity is exactly the thing the SFT phase contributes to the partnership. Running imitation first to build varied, reasonable rollouts, then letting RL sharpen against verifiable rewards, beats either method alone, because the imitation phase creates enough behavioral spread for the reward signal to actually be informative Does sequencing imitation then exploration training improve reasoning?. Without that breadth, RL has pathologies: train it on problems that are too hard and it amplifies degenerate shortcuts that contaminate skills the model already had Do overly hard RLVR samples actually harm model capabilities?. There are even mid-training fixes — step-level critique models that counteract this 'tail narrowing' and keep solution diversity alive across self-training iterations Do critique models improve diversity during training itself?.

One caveat the reader didn't ask for but should have: diversity from SFT is not free competence. Demonstration-trained agents are capped by the imagination of whoever curated the data — they can't learn from their own failures or generalize past demonstrated scenarios because they never interact with an environment Can agents learn beyond what their training data shows?. And there's evidence SFT's apparent learning is shallower than it looks: instruction tuning largely transfers knowledge of the *output format* rather than genuine task understanding Does instruction tuning teach task understanding or output format?. So the real takeaway is a tension, not a verdict: SFT buys you wide exploration but bounded ceilings; RL buys you a higher ceiling but collapses the exploration that found it — which is precisely why the strongest results come from sequencing them rather than choosing one.

Sources 10 notes

Does reinforcement learning squeeze exploration diversity in search agents?

RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Does reinforcement learning update only a small fraction of parameters?

Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.

Does preference tuning always reduce diversity the same way?

RLHF reduces lexical-syntactic diversity in code generation but increases it in creative writing. The direction depends on what each domain incentivizes: code rewards convergence toward correct solutions, while creative writing rewards stylistic distinctiveness.

Does training order reshape how models handle different task types?

Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.

Does sequencing imitation then exploration training improve reasoning?

Running Supervised RL first to establish reasoning foundations, then RLVR to refine against verifiable rewards, substantially outperforms both methods in isolation. The imitation phase makes outcome rewards informative by creating reasonable rollouts the RL phase can then sharpen.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Do critique models improve diversity during training itself?

Step-level critique in the training loop counteracts tail narrowing and maintains solution diversity across self-training iterations. This training-time benefit—preventing premature convergence—is more fundamental than test-time accuracy gains.

Can agents learn beyond what their training data shows?

Agents trained on static expert datasets cannot learn from their own failures or generalize beyond demonstrated scenarios because they never interact with environments during training. Competence is capped by what curators imagined, not by agent capacity.

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question remains open: Why does supervised fine-tuning on diverse demonstrations expand exploration diversity compared to RL?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat each as a snapshot of understanding at its publication date.
• RL converges on single-peak reward optima, causing entropy collapse and parameter sparsity (only 5–30% of weights update in consistent subnetworks) (~2025).
• SFT on diverse demonstrations preserves behavioral breadth by spreading probability mass across dataset behaviors, unlike RL's sharpening (~2025).
• RL's diversity loss reflects concentration of pretraining distributions rather than exploration of new ground; format dominance depends partly on model scale (~2025).
• Domain-dependence: preference tuning *reduces* diversity in structured tasks (code) but *increases* it in open-ended ones (creative writing); entropy dynamics are task-sensitive (~2024–2025).
• SFT + RL sequencing outperforms either alone; imitation builds behavioral spread that makes reward signals informative; mid-training critique models can counteract tail-narrowing (~2024–2026).

Anchor papers (verify; mind their dates):
• arXiv:2504.07912 (2025-04) — Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining
• arXiv:2505.11711 (2025-05) — Reinforcement Learning Finetunes Small Subnetworks in Large Language Models
• arXiv:2605.22817 (2026-05) — Vector Policy Optimization: Training for Diversity Improves Test-Time Search
• arXiv:2411.16579 (2024-11) — Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision

Your task:
(1) RE-TEST EACH CONSTRAINT. For every claim above, assess whether newer models, training methods (e.g., multi-task scheduling, hybrid reward schemes), evaluation harnesses, or post-hoc search techniques (2026–now) have RELAXED the diversity–ceiling tradeoff or OVERTURNED the claim that SFT preserves diversity while RL collapses it. Separate the durable tension (likely still real) from perishable limitations (possibly resolved by curriculum design, auxiliary losses, or vectorized policies). Cite what moved each constraint.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months (if any) that questions whether domain-dependent effects or sequencing strategies truly reconcile the SFT–RL tension.
(3) Propose 2 research questions that ASSUME the diversity–ceiling tradeoff may be *mechanically separable* or that the frontier has shifted toward unified training regimes.

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Why does supervised fine-tuning on diverse demonstrations expand exploration diversity compared to RL?

Sources 10 notes

Next inquiring lines