Why do sparse outcome rewards fail to credit correct tool calls in failed trajectories?
This explores a credit-assignment problem in agent training: when an agent's whole trajectory ends in failure and the only reward is a single pass/fail signal at the end, the good individual moves along the way — like correctly invoked tool calls — get punished along with everything else.
This explores why a single end-of-task reward can't tell the difference between a correct tool call and a wrong one when the overall attempt fails — the good steps get buried under one bad verdict. The corpus frames this as a structural limitation of sparse outcome rewards: one scalar at the end of a long trajectory has no way to point back at which of the dozens of intermediate actions deserved credit. When the trajectory fails, that scalar is negative, so every step — including the tool calls that actually worked — inherits the blame. The reward signal simply lacks the resolution to be more specific.
The most direct line of attack is to stop relying on the terminal signal alone and instead mine the trajectory's own structure for denser feedback. Several methods do exactly this: they convert sparse outcome rewards into per-step signals by exploiting structural features the trajectory already contains — tree topology, expert-aligned actions, and crucially the positions of tool calls themselves — so a correct call can be credited even inside a losing run Can trajectory structure replace hand-annotated process rewards?. A complementary route assigns the full episode reward to each step and then normalizes across many rollouts; the group-relative comparison surfaces which action sequences actually drive success, recovering credit that a single endpoint reward would have flattened Can full episode rewards per step enable better credit assignment?.
The deeper insight running through the collection is that failed trajectories are not noise to be discarded — they carry signal the reward scheme throws away. One thread argues that success and failure should be processed asymmetrically: keep clean positive trajectories as demonstrations, but preserve diverse failures specifically as negative signal rather than deleting them Why do correct code trajectories teach models to tolerate errors?, a stance echoed by work that treats successes as concrete demonstrations and failures as abstracted lessons Should successful and failed episodes be processed differently?. Process reward models that are aware of trajectory shape go further, treating failed and backtracked steps as informative exploration rather than uniform errors Why do standard process reward models fail on thinking traces?. The common thread: a binary win/lose label erases the within-trajectory texture that tells you a tool call was right even though the plan around it wasn't.
There's also a more fundamental claim about what scalar rewards can and cannot encode. Agent feedback decomposes into two orthogonal kinds of information — evaluative ('how well did this do') and directive ('how should it change') — and a single number captures only the first while discarding the second Can scalar rewards capture all the information in agent feedback?. That's why a failed trajectory's reward can't whisper 'the tool call was fine, the reasoning around it wasn't.' The same gap is what lets natural-language critiques break plateaus that numerical rewards can't: the words carry information about *why* a run failed that the scalar mathematically cannot Can natural language feedback overcome numerical reward plateaus?.
Worth knowing if you came in only thinking about tool calls: this credit-assignment failure has a cousin in calibration. Binary correctness rewards don't just misattribute credit across steps — they actively incentivize confident wrong answers, because nothing penalizes high-confidence failure Does binary reward training hurt model calibration?. And rubric-based methods suggest a cleaner division of labor: use coarse signals as gates that accept or reject whole rollout groups, then let finer rewards optimize within the valid ones, rather than forcing one signal to do both jobs Can rubrics and dense rewards work together without hacking?. The pattern across all of these: the fix is rarely a better single number — it's giving the trajectory more places to attach signal.
Sources 9 notes
Tree-GRPO, Supervised RL, and ToolPO each convert sparse outcome rewards into dense step signals by exploiting different structural features—tree topology, expert-aligned actions, and tool-call positions—eliminating the need for annotated process reward models.
MS-GRPO assigns cumulative episode reward to each step, and group-relative normalization across rollouts surfaces which action sequences succeed. A 3B model post-trained this way outperforms 72B baselines by 50%, showing the training method matters more than scale for multi-step tasks.
GRPO-RoC filters positive trajectories for quality while preserving diverse failures as negative signal, allowing a 14B model to reach frontier math performance in 510 RL steps, surpassing much larger models with cleaner reasoning.
SkillRL demonstrates that treating successful episodes as concrete demonstrations and failures as abstracted lessons achieves state-of-the-art performance on complex tasks while using substantially less context than uniform approaches. The asymmetry mirrors human expert reasoning and avoids the degradation seen in uniform consolidation methods.
Standard PRMs degrade on trajectory format because thinking traces include branching, backtracking, and weaker coherence than polished responses. ReasonFlux-PRM addresses this by supervising both trajectories and responses, treating failed steps as informative exploration rather than errors.
Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.
Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.
Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.
DRO shows that using rubrics to accept or reject rollout groups—rather than converting rubric scores into dense rewards—prevents reward hacking. This separation preserves the categorical strength of rubrics while letting token-level rewards optimize within valid answers.