Why does negative experience transfer better than positive examples alone?
This explores why training, prompting, and memory systems often gain more from failures and negative signals than from positive examples by themselves — and what the corpus says is actually happening when they do.
This explores why negative experience — failed trajectories, induced mistakes, suppressed wrong answers — frequently transfers better than positive examples alone, and the corpus has a surprisingly consistent answer across very different methods: positive-only signals concentrate probability mass and narrow behavior, while negative signals prune the space without collapsing it. The clearest version comes from reinforcement learning, where training on *only* negative samples matches or beats full RL because suppressing incorrect trajectories preserves diversity, whereas positive-only reinforcement degrades higher-k performance by piling probability onto a few winning paths Does negative reinforcement alone outperform full reinforcement learning?. Positive examples teach 'do more of this'; negative ones teach 'this region is bad' — and the second leaves far more of the space intact.
The same asymmetry shows up in memory and agent learning, but with a twist about *how* each type of experience should be stored. SkillRL keeps successes as concrete demonstrations to imitate, but abstracts failures into general lessons — and that differential treatment outperforms processing everything uniformly Should successful and failed episodes be processed differently?. ReasoningBank pushes the point further: distilling strategy-level hints from both successes *and* failures beats success-only memory, because failures carry information about boundaries and pitfalls that a clean success simply never reveals Can agents learn better from their failures than successes?. A failure tells you where the cliff is; a success only tells you one safe path along the ridge.
Even without any training, the effect holds at inference time. LEAP deliberately induces a model to err on its own few-shot examples, then has it articulate explicit principles from those mistakes — and this improves reasoning without a single extra label Does learning from mistakes improve in-context learning?. The mistake forces the model to name the rule it was implicitly violating, which a correct example would have let it skate past. This is why positive examples 'alone' underperform: a correct demonstration is compatible with many wrong generalizations, and the learner has no pressure to distinguish them.
Here's the part you might not have expected to want: the failure mode of positive-only learning is not just weaker performance, it's *confident* weakness. Teachers conditioned only on correct answers produce concise, confident traces that suppress uncertainty — students inherit the swagger and lose out-of-distribution robustness Does richer teacher context hurt student generalization?. Imitation training shows the endpoint: copying ChatGPT's fluent, confident style closes no real capability gap, just fools evaluators Can imitating ChatGPT fool evaluators into thinking models improved?. Positive examples are easy to mimic stylistically, which is exactly why they transfer the *look* of competence rather than the thing itself. Negative experience resists that shortcut — you can't fake having learned where the errors are.
There's a real limit worth naming, though: negative signal only helps when the failures are informative. RLVR samples that are nearly impossible produce degenerate shortcuts that contaminate existing skills, because the rare accidental success gets treated as a high-value lesson Do overly hard RLVR samples actually harm model capabilities?. So the principle isn't 'negativity is magic' — it's that negative experience carries discriminative information positive examples can't, as long as the failures sit close enough to the model's frontier to mean something.
Sources 7 notes
Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.
SkillRL demonstrates that treating successful episodes as concrete demonstrations and failures as abstracted lessons achieves state-of-the-art performance on complex tasks while using substantially less context than uniform approaches. The asymmetry mirrors human expert reasoning and avoids the degradation seen in uniform consolidation methods.
ReasoningBank shows that storing strategy-level reasoning hints from both self-judged successes and failures outperforms success-only memory and raw trajectory storage. Coupled with test-time scaling, memory and compute compound rather than substitute, creating a novel scaling law where accuracy improves through cumulative interaction history.
LEAP demonstrates that models achieve better performance on reasoning and math tasks by intentionally erring on few-shot examples, reflecting on mistakes, and deriving explicit task-specific principles—without additional labeled data or fine-tuning.
Teachers conditioned on correct answers and verifier output produce confident, concise traces that students inherit. This style suppresses uncertainty expression, optimizing in-domain performance while degrading generalization to out-of-distribution problems that require epistemic caution.
Imitation models fool human evaluators by mimicking ChatGPT's confident, fluent style while failing to improve factuality or generalization on novel tasks. The ceiling is set by base model capability, not fine-tuning method—better fundamentals, not shortcuts, drive real improvement.
Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.