SYNTHESIS NOTE
Training, RL, and Test-Time Scaling

How should multiple reward objectives be weighted during training?

When training on multiple objectives at once, how can we automatically balance their contributions without manual tuning? This explores whether reward variance within rollouts reveals which objectives carry real learning signal.

Synthesis note · 2026-05-28 · sourced from Reinforcement Learning
What actually changes inside a model during RL training?

When a single GRPO run optimizes several rewards at once — accuracy plus length plus format, say — the standard moves both fail. Reward Combination sums the rewards before computing advantage, which lets the squared advantage magnitudes explode and destabilize training. Advantage Combination computes advantages per objective and then mixes them, but it leans on fixed hyperparameter weights and treats objectives as independent, ignoring how they correlate within a rollout.

DVAO's claim is that the right weighting signal is already sitting in the data: the empirical reward variance of each objective within a rollout group. High variance means the group's responses spread out on that objective — there is a gradient to learn from. Low variance means the objective is either saturated or noise, so its contribution should shrink. Weighting by within-group variance therefore up-weights objectives carrying a real learning signal and down-weights the rest, automatically and without tuned constants.

Why it matters: this reframes multi-objective scalarization as an estimation problem rather than a preference-setting problem. You are no longer asking "how much do I value format versus accuracy?" but "which objective currently has signal worth following?" The paper proves the scheme keeps advantage magnitudes bounded (the stability win) and folds in a cross-objective regularizer so each objective's gradient is modulated by the rollout's overall multi-objective performance. The counterpoint is that variance can be a misleading proxy — a noisy reward model also produces high variance — so the method presumes reward signals are clean enough that spread tracks learnability.

Inquiring lines that use this note as a source 15

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
13 direct connections · 126 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

multi-reward grpo should weight each objective by its empirical reward variance