SYNTHESIS NOTE

How should multiple reward objectives be weighted during training?

When training on multiple objectives at once, how can we automatically balance their contributions without manual tuning? This explores whether reward variance within rollouts reveals which objectives carry real learning signal.

Synthesis note · 2026-05-28 · sourced from Reinforcement Learning

When a single GRPO run optimizes several rewards at once — accuracy plus length plus format, say — the standard moves both fail. Reward Combination sums the rewards before computing advantage, which lets the squared advantage magnitudes explode and destabilize training. Advantage Combination computes advantages per objective and then mixes them, but it leans on fixed hyperparameter weights and treats objectives as independent, ignoring how they correlate within a rollout.

DVAO's claim is that the right weighting signal is already sitting in the data: the empirical reward variance of each objective within a rollout group. High variance means the group's responses spread out on that objective — there is a gradient to learn from. Low variance means the objective is either saturated or noise, so its contribution should shrink. Weighting by within-group variance therefore up-weights objectives carrying a real learning signal and down-weights the rest, automatically and without tuned constants.

Why it matters: this reframes multi-objective scalarization as an estimation problem rather than a preference-setting problem. You are no longer asking "how much do I value format versus accuracy?" but "which objective currently has signal worth following?" The paper proves the scheme keeps advantage magnitudes bounded (the stability win) and folds in a cross-objective regularizer so each objective's gradient is modulated by the rollout's overall multi-objective performance. The counterpoint is that variance can be a misleading proxy — a noisy reward model also produces high variance — so the method presumes reward signals are clean enough that spread tracks learnability.

Inquiring lines that use this note as a source 15

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

13 direct connections · 126 in 2-hop network ·dense cluster Open in graph ↗

How should multiple reward objectives be weighte… Can two simple techniques match complex RL algorit… Can full episode rewards per step enable better cr… Why does agent efficiency differ from model size r… Can reward vectors be the hidden source of solutio…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can two simple techniques match complex RL algorithms? Does vanilla PPO with minimal modifications rival more sophisticated reasoning algorithms like GRPO and DAPO? This explores whether algorithmic complexity is necessary for effective LLM reasoning training.
both operate on the advantage-estimation machinery; DVAO adds a variance-adaptive weighting layer on top of the normalization tricks
Can full episode rewards per step enable better credit assignment? Can attributing cumulative episode reward to every step in a trajectory, rather than discounting by step distance, actually solve credit assignment in sequential LLM decision-making? This challenges intuitive RL assumptions about how credit should flow backward through time.
complementary axis: that note handles credit across time, DVAO handles credit across competing objectives
Why does agent efficiency differ from model size reduction? Explores why making models smaller doesn't solve agent cost problems. Agents loop recursively, compounding costs multiplicatively, so efficiency requires system-level design, not just parameter reduction.
both frame training as Pareto-frontier navigation rather than single-objective maximization
Can reward vectors be the hidden source of solution diversity? Standard RL collapses multi-dimensional rewards into scalars before training, losing the natural structure that could drive diverse specialization. What if that vector structure itself is the diversity axis?
contrasts: both keep rewards multi-dimensional rather than scalarizing, but DVAO collapses objectives into a variance-weighted advantage while vector rewards preserve the per-dimension structure to drive diversity

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

multi-reward grpo should weight each objective by its empirical reward variance

How should multiple reward objectives be weighted during training?

Related concepts in this collection 4

Related papers in this collection 8

Search by related questions 4