How should multiple reward objectives be weighted during training?
When training on multiple objectives at once, how can we automatically balance their contributions without manual tuning? This explores whether reward variance within rollouts reveals which objectives carry real learning signal.
When a single GRPO run optimizes several rewards at once — accuracy plus length plus format, say — the standard moves both fail. Reward Combination sums the rewards before computing advantage, which lets the squared advantage magnitudes explode and destabilize training. Advantage Combination computes advantages per objective and then mixes them, but it leans on fixed hyperparameter weights and treats objectives as independent, ignoring how they correlate within a rollout.
DVAO's claim is that the right weighting signal is already sitting in the data: the empirical reward variance of each objective within a rollout group. High variance means the group's responses spread out on that objective — there is a gradient to learn from. Low variance means the objective is either saturated or noise, so its contribution should shrink. Weighting by within-group variance therefore up-weights objectives carrying a real learning signal and down-weights the rest, automatically and without tuned constants.
Why it matters: this reframes multi-objective scalarization as an estimation problem rather than a preference-setting problem. You are no longer asking "how much do I value format versus accuracy?" but "which objective currently has signal worth following?" The paper proves the scheme keeps advantage magnitudes bounded (the stability win) and folds in a cross-objective regularizer so each objective's gradient is modulated by the rollout's overall multi-objective performance. The counterpoint is that variance can be a misleading proxy — a noisy reward model also produces high variance — so the method presumes reward signals are clean enough that spread tracks learnability.
Inquiring lines that use this note as a source 15
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How do production recommenders already combine multiple objectives in practice?
- Can reward model training be automated without changing feedback mechanisms?
- Why does multi-objective ranking make the political dimensions of weight choices more visible?
- What makes Effective Rank Acceleration a stable training signal for dual-channel incentives?
- How do composite rewards attribute curation outcomes to specific skill library changes?
- How do you prevent stale reward signals when skills evolve during deployment?
- Can dynamic variance weighting replace fixed objective combination weights?
- Why does scalarization of rewards fail for multi-objective GRPO training?
- How does credit assignment across objectives differ from credit assignment across time?
- Why does group-relative normalization make uniform episode rewards work across rollouts?
- How should multi-objective post-training balance competing behavioral goals?
- Can the same variance signal work as both reward and query filter?
- How does DVAO balance reward components differently than VPO spreads them?
- How can developers balance multiple conflicting fairness goals simultaneously?
- What makes advantage shaping more stable than reward shaping for tool training?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can two simple techniques match complex RL algorithms?
Does vanilla PPO with minimal modifications rival more sophisticated reasoning algorithms like GRPO and DAPO? This explores whether algorithmic complexity is necessary for effective LLM reasoning training.
both operate on the advantage-estimation machinery; DVAO adds a variance-adaptive weighting layer on top of the normalization tricks
-
Can full episode rewards per step enable better credit assignment?
Can attributing cumulative episode reward to every step in a trajectory, rather than discounting by step distance, actually solve credit assignment in sequential LLM decision-making? This challenges intuitive RL assumptions about how credit should flow backward through time.
complementary axis: that note handles credit across time, DVAO handles credit across competing objectives
-
Why does agent efficiency differ from model size reduction?
Explores why making models smaller doesn't solve agent cost problems. Agents loop recursively, compounding costs multiplicatively, so efficiency requires system-level design, not just parameter reduction.
both frame training as Pareto-frontier navigation rather than single-objective maximization
-
Can reward vectors be the hidden source of solution diversity?
Standard RL collapses multi-dimensional rewards into scalars before training, losing the natural structure that could drive diverse specialization. What if that vector structure itself is the diversity axis?
contrasts: both keep rewards multi-dimensional rather than scalarizing, but DVAO collapses objectives into a variance-weighted advantage while vector rewards preserve the per-dimension structure to drive diversity
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- DVAO: Dynamic Variance-adaptive Advantage Optimization for Multi-reward Reinforcement Learning
- Part I: Tricks or Traps? A Deep Dive into RL for LLM Reasoning
- Surveying the Effects of Quality, Diversity, and Complexity in Synthetic Data From Large Language Models
- SimPO: Simple Preference Optimization with a Reference-Free Reward
- On-Policy RL Meets Off-Policy Experts: Harmonizing Supervised Fine-Tuning and Reinforcement Learning via Dynamic Weighting
- A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?
- Misaligned by Design: Incentive Failures in Machine Learning
- Improving Reinforcement Learning from Human Feedback Using Contrastive Rewards
Original note title
multi-reward grpo should weight each objective by its empirical reward variance