How should we evaluate diversity differently across programming and creative tasks?

This explores why 'diversity' means opposite things when you're generating code versus writing prose — and why a single diversity metric breaks when you apply it across both.

This explores why diversity should be measured differently for programming and creative tasks, and the corpus has a sharp, almost counterintuitive answer: the same training move pushes diversity in opposite directions depending on the domain. Preference tuning (RLHF) *reduces* lexical and syntactic variety in code generation but *increases* it in creative writing — because each domain rewards something different. Code rewards convergence toward the one correct solution; creative writing rewards stylistic distinctiveness. So a metric that treats 'more variation = better' is actively wrong for code, where variation past the correct answer is often noise Does preference tuning always reduce diversity the same way?.

The deeper lesson is that 'diversity' isn't one quantity. One note pulls it apart into three axes — quality, diversity, and complexity — that produce distinct downstream effects: quality drives in-distribution generalization, diversity enables out-of-distribution generalization, and collapsing them into a single score is exactly what makes self-improvement loops quietly degrade through irreversible diversity loss How do quality, diversity, and complexity affect synthetic data differently?. For programming, what you usually want to measure is *functional* diversity — distinct correct strategies — not surface-level token variety. For creative work, surface variety (lexical, perspectival) is the point, not a proxy.

That distinction matters because creative diversity has its own failure mode that code doesn't share: models can scale *claims* without scaling *perspectives*. A thousand AI-written articles often encode roughly one viewpoint, because the model follows probabilistic patterns rather than exploring competing positions Does AI generate diverse claims or diverse perspectives?. Worse, different models converge on near-identical open-ended outputs — an 'Artificial Hivemind' across 70+ models — so for creative tasks you can't even trust an ensemble to be diverse, and lexical-overlap metrics will badly overstate how much real variety exists Do different AI models actually produce diverse outputs?. None of these traps show up if you only ever benchmark code.

There's also a budget angle that flips intuition: for generating varied outputs, smaller models (~500M params) produce more unique samples per draw than larger ones, because big models concentrate probability mass on a few preferred outputs Why aren't bigger models better for generating diverse outputs?. So even the right *evaluation* depends on what you're optimizing — and one method, DARLING, shows you can reward semantic diversity directly and improve quality *and* diversity together across both creative and mathematical tasks, which suggests the two domains don't need entirely separate machinery so much as domain-aware diversity definitions Can diversity optimization improve quality during language model training?.

If you want to go one layer down, the thing both domains share is fragility under reinforcement learning: RL compresses behavioral diversity through entropy collapse — policies converge on narrow reward-maximizing strategies — whether the task is reasoning, search, or generation Does reinforcement learning squeeze exploration diversity in search agents?. The takeaway for evaluation: for code, measure whether you've preserved enough distinct *correct* solutions; for creative work, measure whether you've preserved distinct *viewpoints*, not just distinct words — and don't trust a single diversity number to do either job.

Sources 7 notes

Does preference tuning always reduce diversity the same way?

RLHF reduces lexical-syntactic diversity in code generation but increases it in creative writing. The direction depends on what each domain incentivizes: code rewards convergence toward correct solutions, while creative writing rewards stylistic distinctiveness.

How do quality, diversity, and complexity affect synthetic data differently?

Quality drives in-distribution generalization, diversity enables out-of-distribution generalization, and complexity strengthens both. Current evaluation methods collapse these into a single quality metric, causing self-improvement loops to degrade through irreversible diversity loss.

Does AI generate diverse claims or diverse perspectives?

Large language models generate numerous well-formed claims by following probabilistic patterns in training data, not by exploring competing argumentative positions. This produces volume without perspectival diversity—a thousand AI articles often represent approximately one viewpoint.

Do different AI models actually produce diverse outputs?

INFINITY-CHAT analyzed 70+ models across 26K open-ended queries and found an "Artificial Hivemind" effect: models independently generate strikingly similar or identical responses due to overlapping training data and alignment procedures, undermining the diversity benefits of model ensembles.

Why aren't bigger models better for generating diverse outputs?

Research shows that for synthetic data generation, models around 500M parameters outperform larger ones in output diversity per sample. Larger models concentrate probability mass on preferred outputs, reducing the variety of distinct samples generated within a fixed budget.

Can diversity optimization improve quality during language model training?

DARLING jointly optimizes for quality and semantic diversity using a learned classifier, finding that diversity rewards catalyze exploration and produce higher-quality outputs than quality-only baselines across both creative and mathematical tasks.

Does reinforcement learning squeeze exploration diversity in search agents?

RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.

How should we evaluate diversity differently across programming and creative tasks?

Sources 7 notes

Next inquiring lines