INQUIRING LINE

Why does positive reinforcement degrade diversity at higher k values?

This explores why training a model on its own correct answers (positive reinforcement) hurts the diversity of solutions you can sample — specifically Pass@k, where you draw k attempts and check if any succeed.


This explores why rewarding a model for its correct answers — positive reinforcement — quietly erodes performance when you sample many attempts, not just one. The cleanest answer in the corpus is mechanical: positive-only reinforcement works by concentrating probability mass onto the trajectories that already succeed. At k=1 that looks like an improvement, because the single most-likely output is now more reliable. But Pass@k at higher k depends on the model still being *able* to produce many different correct paths. Once probability has been vacuumed onto a few winning trajectories, the long tail of alternative-but-valid solutions thins out, so drawing more samples stops buying you new ways to succeed. Does negative reinforcement alone outperform full reinforcement learning? makes the contrast sharp: training on *only* the wrong answers — pushing probability away from failures rather than toward winners — matches or beats full RL on Pass@k precisely because it suppresses bad trajectories without collapsing the spread of good ones.

The interesting twist is that this isn't a local effect on the problems you trained on. Does outcome-based RL diversity loss spread across unsolved problems? shows the sharpening is global — rewarding final-answer correctness concentrates the policy everywhere, so diversity also drains away on problems the model never solved and never got reward signal for. The model becomes more confident in general, including in places where confidence is exactly what you don't want.

This is the same phenomenon researchers call entropy collapse, and it shows up far outside math reasoning. Does reinforcement learning squeeze exploration diversity in search agents? documents the identical squeeze in search agents — policies converge on narrow reward-maximizing strategies — and notes that supervised fine-tuning on diverse demonstrations preserves the breadth that RL destroys. So the degradation isn't a quirk of one task; it's what scalar-reward maximization does by construction.

Why higher k specifically gets hit hardest connects to a deeper point: when you plan to sample many times or run search at inference, the *right* training objective changes. Should training maximize diversity when models feed into search? argues that a model feeding into evolutionary or repeated-sampling procedures should be trained to emit many competent-but-different solutions, because an entropy-collapsed policy literally cannot reach problems that require combining modes. Positive reinforcement optimizes the wrong thing for that regime — it maximizes the best single guess while quietly destroying the variety that high-k sampling exists to exploit.

The corpus also points to fixes that recover diversity without giving up quality, which is the part you might not know you wanted. Can reward vectors be the hidden source of solution diversity? keeps rewards unscalarized — decomposed per test-case or criterion — so solutions specialize along real trade-offs instead of collapsing to one. Can diversity optimization improve quality during language model training? adds a semantic-diversity reward and finds it *catalyzes* exploration, producing higher quality than quality-only training. And Do critique models improve diversity during training itself? shows step-level critique inside the training loop counteracts the tail-narrowing directly. One caveat worth holding: Does preference tuning always reduce diversity the same way? finds the direction isn't universal — reinforcement compresses diversity where the domain rewards convergence (code) but can expand it where the domain rewards distinctiveness (creative writing). The degradation at high k is what happens when your reward says 'one right answer,' which is most of reasoning — but not all of everything.


Sources 8 notes

Does negative reinforcement alone outperform full reinforcement learning?

Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.

Does outcome-based RL diversity loss spread across unsolved problems?

RL that rewards only final answer correctness sharpens the policy globally, concentrating probability mass on correct trajectories for solved problems while simultaneously reducing diversity on unsolved ones. Historical exploration (training diversity via UCB-style bonuses) and batch exploration (test-time diversity via repetition penalties) require structurally different mechanisms.

Does reinforcement learning squeeze exploration diversity in search agents?

RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.

Should training maximize diversity when models feed into search?

Vector Policy Optimization trains models to emit varied competent solutions rather than converging to one answer. This unlocks search procedures like evolutionary algorithms to explore and combine modes, solving problems that entropy-collapsed policies cannot reach at all.

Can reward vectors be the hidden source of solution diversity?

Vector Policy Optimization shows that rewards decomposed per test-case, criterion, or persona provide an inherent diversity structure. Training solutions to span the Pareto frontier across these dimensions produces competent diversity grounded in real task trade-offs rather than external regularizers.

Can diversity optimization improve quality during language model training?

DARLING jointly optimizes for quality and semantic diversity using a learned classifier, finding that diversity rewards catalyze exploration and produce higher-quality outputs than quality-only baselines across both creative and mathematical tasks.

Do critique models improve diversity during training itself?

Step-level critique in the training loop counteracts tail narrowing and maintains solution diversity across self-training iterations. This training-time benefit—preventing premature convergence—is more fundamental than test-time accuracy gains.

Does preference tuning always reduce diversity the same way?

RLHF reduces lexical-syntactic diversity in code generation but increases it in creative writing. The direction depends on what each domain incentivizes: code rewards convergence toward correct solutions, while creative writing rewards stylistic distinctiveness.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a reinforcement learning researcher re-evaluating claims about positive RL's effect on sample diversity. The question remains open: *under what training and inference regimes does scalar-reward RL erode the diversity needed for high-k sampling, and what interventions genuinely recover it?*

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–11 through 2026–05.
• Positive reinforcement concentrates probability onto successful trajectories, collapsing the long tail of alternative-but-valid solutions; negative-only RL (suppressing failures) matches or exceeds full RL on Pass@k without this squeeze (2025–06).
• Diversity loss from outcome-based RL is global and transfers to unsolved problems, not local to trained tasks (2025–09).
• The same entropy collapse appears in search agents: policies converge on narrow reward-maximizing strategies while SFT on diverse demos preserves breadth (path spans 2024–2026).
• Fixes exist: vector-valued rewards (unscalarized per test case), explicit semantic-diversity bonuses, and step-level critique in training all recover diversity without sacrificing quality (2025–09, 2026–05).
• Domain dependence: compression occurs when rewards favor single solutions (reasoning, code) but may not hold in domains rewarding distinctiveness (creative writing) (path notes, 2025+).

Anchor papers (verify; mind their dates):
• 2506.01347 (negative RL superiority)
• 2509.02534 (joint diversity–quality reinforcement)
• 2605.22817 (vector policy optimization)
• 2411.16579 (critique-driven diversity)

Your task:
(1) RE-TEST EACH CONSTRAINT. For each claim above, probe whether post-2026–06 scaling (larger models, longer horizon reasoning, multi-agent orchestration) or newer reward abstractions (e.g., process rewards, hierarchical RL) have shifted the boundary where diversity collapses. Does the tail-narrowing still hold for reasoning on current frontier models (o1, o3 scale)? Has explicit diversity-as-reward become standard in production RL fine-tuning, and if so, does it materially change the tradeoff? Flag which constraints appear robust and which may have moved.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months: does any recent paper claim scalar RL *preserves* or *enhances* diversity under specific conditions, and if so, what are those conditions? Does any work suggest the diversity loss is an artifact of small-scale experiments, not scaling?
(3) Propose two research questions assuming the regime has moved: (a) If diversity loss is now routinely mitigated in production RL pipelines, what is the *new* bottleneck for high-k performance? (b) Can vector or multi-objective rewards fully decouple quality from diversity, or do they introduce new failure modes at scale?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines