Can diversity-aware RL objectives prevent format convergence?

This explores whether RL objectives that explicitly reward diversity can stop the well-documented tendency of RL training to collapse a model's outputs into one dominant format or strategy.

This explores whether building diversity into the reward itself can counteract the way reinforcement learning tends to flatten a model's range of outputs into a single mold. First, it helps to see how strong that flattening pressure is. RL post-training doesn't gradually erode variety — it picks a winner fast: experiments show RL latches onto one dominant format inherited from pretraining and suppresses the alternatives within the first epoch, and which format wins depends on model scale rather than on which format actually performs best Does RL training collapse format diversity in pretrained models?. The same squeeze shows up beyond formatting — search agents lose exploration breadth through the identical entropy-collapse mechanism seen in reasoning models, with policies converging on narrow reward-maximizing strategies Does reinforcement learning squeeze exploration diversity in search agents?. Outcome-only rewards make it worse: sharpening the policy on solved problems bleeds diversity loss onto unsolved ones too Does outcome-based RL diversity loss spread across unsolved problems?.

So can a diversity-aware objective actually prevent this? The most direct evidence says yes — and with a bonus. DARLING jointly optimizes for quality and *semantic* diversity using a learned classifier, and finds the diversity reward doesn't just preserve variety, it catalyzes exploration and yields higher-quality outputs than quality-only baselines on both creative and mathematical tasks Can diversity optimization improve quality during language model training?. That's the key insight you might not expect: diversity and quality aren't a trade-off here — the exploration that diversity rewards finds better answers. A related move shows the same dual-purpose trick with a single statistic: cross-rollout variance can both weight tokens and filter degenerate queries, stabilizing training while keeping it efficient Can one statistical measure serve dual purposes in RL training?.

But "diversity-aware" turns out to be domain-dependent, and that complicates any one-size answer. Preference tuning *reduces* lexical diversity in code (where convergence toward correct solutions is the point) while *increasing* it in creative writing Does preference tuning always reduce diversity the same way?. Multi-task work makes this mechanical: structured domains drive output entropy down, creative domains drive it up, so the order you train them in matters — train structured tasks first and you avoid letting entropy collapse damage open-ended skills, a 6% gain over naive joint training Does training order reshape how models handle different task types?. The lesson is that a diversity objective isn't a global knob; it has to know which domain it's protecting.

There are also failure modes a diversity objective alone won't fix. Overly hard RLVR samples push models into degenerate shortcuts — answer repetition, skipped computation — that contaminate existing capabilities, because group-relative normalization treats rare lucky successes as high-value Do overly hard RLVR samples actually harm model capabilities?. Binary rewards separately wreck calibration by rewarding confident guessing, fixable by adding a Brier-score term Does binary reward training hurt model calibration?. Both suggest the same structural fix as diversity rewards — a second objective that the dominant reward can't bulldoze.

Worth knowing: the convergence problem is bigger than any single training run. Across 70+ models and 26K open-ended queries, different LLMs independently produce strikingly similar outputs — an "Artificial Hivemind" born of overlapping training data and shared alignment procedures Do different AI models actually produce diverse outputs?. That means format convergence isn't just something RL does to one model; it's a field-wide attractor, and diversity-aware objectives are one of the few levers pointed directly against it. For an architectural rather than reward-based angle, structuring a model's reasoning as an internal dialogue rather than a monologue also recovers diversity on tasks needing multiple approaches Can dialogue format help models reason more diversely?.

Sources 11 notes

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Does reinforcement learning squeeze exploration diversity in search agents?

RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.

Does outcome-based RL diversity loss spread across unsolved problems?

RL that rewards only final answer correctness sharpens the policy globally, concentrating probability mass on correct trajectories for solved problems while simultaneously reducing diversity on unsolved ones. Historical exploration (training diversity via UCB-style bonuses) and batch exploration (test-time diversity via repetition penalties) require structurally different mechanisms.

Can diversity optimization improve quality during language model training?

DARLING jointly optimizes for quality and semantic diversity using a learned classifier, finding that diversity rewards catalyze exploration and produce higher-quality outputs than quality-only baselines across both creative and mathematical tasks.

Can one statistical measure serve dual purposes in RL training?

DRO reuses a single self-supervised statistic at two aggregation levels: token-level weighting in dense rewards and query-level filtering to discard degenerate comparisons. This dual use achieves 2–3× faster training with better stability on unverifiable tasks.

Does preference tuning always reduce diversity the same way?

RLHF reduces lexical-syntactic diversity in code generation but increases it in creative writing. The direction depends on what each domain incentivizes: code rewards convergence toward correct solutions, while creative writing rewards stylistic distinctiveness.

Does training order reshape how models handle different task types?

Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Do different AI models actually produce diverse outputs?

INFINITY-CHAT analyzed 70+ models across 26K open-ended queries and found an "Artificial Hivemind" effect: models independently generate strikingly similar or identical responses due to overlapping training data and alignment procedures, undermining the diversity benefits of model ensembles.

Can dialogue format help models reason more diversely?

DialogueReason, which structures a single model's internal reasoning as dialogue between distinct agents in separate scenes, overcomes monologue reasoning's fixed-strategy and fragmented-attention weaknesses, especially on tasks requiring multiple problem-solving approaches.

Can diversity-aware RL objectives prevent format convergence?

Sources 11 notes

Next inquiring lines