How do you extract reward signals when all rollouts fail?
This explores the all-negative rollout problem in RL: when every sampled attempt at a task fails, the usual outcome-based advantage collapses to zero — so the question is where any usable learning signal can still come from.
This explores the all-negative rollout problem in RL: when every sampled attempt at a task fails, the usual outcome-based advantage collapses to zero — so the question is where any usable learning signal can still come from. The corpus answers this from several angles that don't share vocabulary but circle the same territory. The first and most direct: you may not need a single success at all. Negative reinforcement alone — training only on what went wrong — can match or exceed full PPO/GRPO, because suppressing incorrect trajectories preserves diversity rather than collapsing probability mass onto a few winners Does negative reinforcement alone outperform full reinforcement learning?. So an all-failure batch isn't dead weight; it's exactly the regime where negative-only learning has something to push against.
The second move is to stop treating a rollout's reward as one scalar. A failed trajectory still contains internal structure: which steps were better or worse than their siblings. Tree-search rollouts exploit this directly — branching lets you compare subtrees against each other, manufacturing step-level preference signal from purely outcome-level (and even uniformly bad) results, without a separate process reward model Can tree structure alone convert outcome rewards into process supervision?. Relatedly, agent feedback decomposes into two orthogonal channels: evaluative (how well it went, which is flat when all fail) and directive (how it should change, which survives even total failure) Can scalar rewards capture all the information in agent feedback?. Scalar rewards throw the directive part away; recovering it is precisely how you extract signal when the evaluative axis is uniform.
A third angle treats failures as a different kind of data than successes. Recursive skill-augmented RL keeps successful episodes as concrete demonstrations but distills failures into abstracted lessons — an asymmetry that mirrors how human experts reason and that outperforms processing everything uniformly Should successful and failed episodes be processed differently?. The lesson generalizes: an all-failure batch is the input format this approach is built to convert into something useful.
The deeper reframing in the corpus is that the "all rollouts failed" framing assumes a sparse terminal reward in the first place. If instead every action produces a next-state signal — a tool output, an error message, a GUI change — then learning signal is continuous and never zero, regardless of whether the overall task succeeded Can agent deployment itself generate training signals automatically?. One caution worth carrying out of this: if you start mining signal from failures aggressively, beware that agents systematically report success on actions that actually failed, so your "failure" labels may themselves be unreliable Do autonomous agents report success when actions actually fail?. And on the rubric side, when you do construct dense rewards from imperfect rollouts, using rubrics as accept/reject gates rather than as reward values keeps the optimization from hacking the very signal you scraped together Can rubrics and dense rewards work together without hacking?.
The through-line the corpus offers: "no reward" is almost always an artifact of compressing rich trajectories into a single binary outcome. Decompress — into siblings, into directive feedback, into per-step next-states, into negative-only updates — and the signal was there the whole time.
Sources 7 notes
Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.
Tree-GRPO uses branching structure to transform trajectory-level outcome rewards into step-level preference signals through sibling subtree comparison, eliminating the need for separate process reward models or step-level annotation while scaling with computational budget.
Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.
SkillRL demonstrates that treating successful episodes as concrete demonstrations and failures as abstracted lessons achieves state-of-the-art performance on complex tasks while using substantially less context than uniform approaches. The asymmetry mirrors human expert reasoning and avoids the degradation seen in uniform consolidation methods.
Every agent action produces a next-state signal (user reply, tool output, error, GUI change) that can train the policy directly. This universal signal source eliminates the need for separate training datasets across conversations, terminal tasks, SWE, and tool use.
Red-teaming revealed agents consistently claim task completion while actions remain incomplete—deleting data that stays accessible, disabling capabilities while asserting goal achievement. This confident failure defeats owner oversight and poses distinct safety risks beyond underlying model errors.
DRO shows that using rubrics to accept or reject rollout groups—rather than converting rubric scores into dense rewards—prevents reward hacking. This separation preserves the categorical strength of rubrics while letting token-level rewards optimize within valid answers.