Can the same variance signal work as both reward and query filter?
This explores whether one statistic — the spread (variance) across multiple sampled answers to the same prompt — can do double duty: shaping the reward that trains the model *and* deciding which prompts are worth training on at all.
This explores whether a single variance measure can serve as both a reward signal and a query filter — and the corpus has a direct answer plus a surprising amount of surrounding texture. The clearest 'yes' comes from DRO, which reuses one self-supervised statistic — cross-rollout variance, the disagreement among multiple answers sampled for the same prompt — at two different altitudes. Aggregated at the token level it becomes a dense reward that weights which parts of a response to reinforce; aggregated at the query level it becomes a filter that throws out prompts where all rollouts look the same and the comparison is therefore degenerate. The payoff isn't just elegance: this dual use reportedly trains 2–3× faster with better stability on tasks that have no verifiable ground truth Can one statistical measure serve dual purposes in RL training?.
What makes this work is that variance is a measure of *signal density* — and once you see it that way, the same idea shows up elsewhere as a weighting principle. DVAO weights multiple reward objectives by their within-group variance, automatically amplifying objectives that carry real signal and muting noisy ones, with no hyperparameter tuning How should multiple reward objectives be weighted during training?. That's the same intuition as the query filter, just pointed at a different axis: low variance means low information, so down-weight or discard it. The filtering side of DRO is really the extreme version of this — a hard cutoff rather than a soft weight.
The filter-versus-reward distinction turns out to matter for reasons beyond efficiency. A companion DRO finding argues that some signals should be used as *gates* (accept or reject a whole rollout group) rather than converted into dense rewards, because forcing a categorical judgment into a continuous reward invites reward hacking. Rubrics used to admit or reject rollouts prevent the gaming that the same rubrics cause when turned into per-token scores Can rubrics and dense rewards work together without hacking?. So the dual-use story isn't 'one signal, used identically twice' — it's 'one signal whose categorical strength belongs at the filter stage and whose graded strength belongs at the reward stage.'
There's a deeper reason variance is a natural filter here: it tracks diversity, and diversity is exactly what reinforcement learning tends to destroy. Negative-only reinforcement matches or beats full RL specifically because suppressing wrong trajectories preserves answer diversity, whereas positive-only reinforcement collapses probability mass onto a few modes and hurts performance at higher k Does negative reinforcement alone outperform full reinforcement learning?. A variance-based query filter is, in effect, a diversity meter — it keeps the prompts where the model still disagrees with itself and is therefore still learning.
The broader corpus also warns that not every internal signal is safe to reuse this freely. Spurious or even random rewards can boost reasoning, but only for models whose pretraining left the right latent behavior to activate — the same reward signal helps Qwen and does nothing for Llama Why do random rewards improve reasoning for some models but not others?. And self-derived signals like model confidence can serve as rewards that improve reasoning and restore calibration, showing the appetite for label-free signals is real Can model confidence work as a reward signal for reasoning?. The thread tying these together: a statistic the model generates about itself can be repurposed across the training loop — but where you apply it (filter vs. reward, gate vs. score) is as consequential as the statistic you choose.
Sources 6 notes
DRO reuses a single self-supervised statistic at two aggregation levels: token-level weighting in dense rewards and query-level filtering to discard degenerate comparisons. This dual use achieves 2–3× faster training with better stability on unverifiable tasks.
DVAO weights objectives by their within-group variance, automatically up-weighting high-signal objectives and suppressing noise without hyperparameter tuning. This keeps advantage magnitudes bounded and replaces fixed scalarization constants with data-driven weighting.
DRO shows that using rubrics to accept or reject rollout groups—rather than converting rubric scores into dense rewards—prevents reward hacking. This separation preserves the categorical strength of rubrics while letting token-level rewards optimize within valid answers.
Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.
Qwen2.5-Math gains 16-25% MATH-500 improvement from random or incorrect rewards by activating latent code-reasoning behavior from pretraining, while Llama and OLMo show no gains. Pretraining format determines what optimization pressure can surface.
RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.