How do loss functions simultaneously shape both learning and decision quality?
This explores a tension the corpus keeps surfacing: the loss or reward function does double duty — it sculpts what a model internally learns (its representations) at the same time as it tunes how the model commits to answers (its decisions) — and the corpus shows these two jobs can quietly pull in opposite directions.
This explores a tension the corpus keeps surfacing: the loss or reward function does double duty — it sculpts what a model internally learns (its representations) at the same time as it tunes how it commits to answers — and these two jobs can pull apart. The sharpest statement of this is the finding that weighting your training loss by the *value* of each decision actually backfires Can utility-weighted training loss actually harm model performance?. Utility-weighting makes the model better at choosing, but it starves the gradient signals the model needs to learn good features in the first place. The counterintuitive fix is to separate the two jobs: train with a plain symmetric loss so learning stays rich, then bend the predictions toward your utility goal afterward, post-hoc. The loss that's best for learning is not the loss that's best for deciding.
Calibration tells the same story from another angle. Reward a model only for being right or wrong — a binary signal — and it learns to make confident guesses, because nothing in the loss punishes a confident wrong answer Does binary reward training hurt model calibration?. The decision (the final answer) improves while the model's sense of *how sure it should be* rots. Adding a proper scoring rule like the Brier score as a second term mathematically rejoins the two, optimizing accuracy and calibration together rather than trading one for the other. The lesson generalizes: when a single objective can't carry both burdens, you either add a term or split the pipeline.
What a loss reinforces also depends on what the domain rewards, so the *same* tuning recipe shapes decisions in opposite directions across tasks. Preference tuning collapses diversity in code generation — where the right answer is convergent — but *increases* it in creative writing, where distinctiveness is the thing being rewarded Does preference tuning always reduce diversity the same way?. The loss didn't change; the decision landscape it was shaping did. Push the difficulty too far and the shaping turns destructive: training on near-impossible problems makes models learn degenerate shortcuts — answer-repetition, skipped computation — that then contaminate capabilities they already had, because the reward-normalization machinery treats lucky accidental successes as gold Do overly hard RLVR samples actually harm model capabilities?.
The most elegant resolution in the corpus is to make one signal honestly serve both roles at once. Cross-rollout variance gets reused at two levels simultaneously — token-level weighting to shape the dense reward (learning) and query-level filtering to throw out degenerate comparisons (decision quality of *what to train on*) — and that dual use buys 2–3× faster, more stable training Can one statistical measure serve dual purposes in RL training?. A related move is to stop processing all outcomes uniformly: treat successes as concrete demonstrations and failures as abstracted lessons, so the learning signal extracted from each decision is shaped by its kind Should successful and failed episodes be processed differently?.
The thread tying these together is worth carrying away: a loss function is never just "the thing you minimize." It is simultaneously a teacher (shaping internal representations) and a referee (shaping which answers get committed to), and the corpus's recurring discovery is that optimizing hard for the referee's job can sabotage the teacher's — which is why so many of these papers end up either adding a second term, splitting learning from decision into two stages, or finding a signal honest enough to do both at once.
Sources 6 notes
Asymmetric loss functions correctly incentivize choosing but degrade representation learning by reducing gradient signals for substantive feature acquisition. Training with symmetric loss then adjusting predictions post-hoc outperforms direct utility-weighted training on the same utility objective.
Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.
RLHF reduces lexical-syntactic diversity in code generation but increases it in creative writing. The direction depends on what each domain incentivizes: code rewards convergence toward correct solutions, while creative writing rewards stylistic distinctiveness.
Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.
DRO reuses a single self-supervised statistic at two aggregation levels: token-level weighting in dense rewards and query-level filtering to discard degenerate comparisons. This dual use achieves 2–3× faster training with better stability on unverifiable tasks.
SkillRL demonstrates that treating successful episodes as concrete demonstrations and failures as abstracted lessons achieves state-of-the-art performance on complex tasks while using substantially less context than uniform approaches. The asymmetry mirrors human expert reasoning and avoids the degradation seen in uniform consolidation methods.