Why do correct code trajectories teach models to tolerate errors?
Explores why standard outcome-based RL fails for code tool use: when models receive reward for correct final answers despite intermediate code errors, they learn that mistakes are acceptable, producing poor reasoning quality.
When language models learn to use coding tools during RL training, the code environment introduces a specific form of noise that standard outcome-based RL cannot handle. The model inevitably generates syntactically or logically incorrect code during reasoning, producing error messages and wasted tokens on correction. Under standard GRPO (which uses only outcome rewards), trajectories with failed intermediate tool calls still receive positive reward if the final answer is correct. The model learns that code errors are acceptable — producing lengthy, low-quality reasoning trajectories with unnecessary error-correction loops.
rStar2-Agent (2025) proposes GRPO-RoC (Resampling on Correct), which applies asymmetric filtering:
- Oversample — generate a larger group of rollouts than the standard batch size
- Filter positive trajectories — from correct-answer rollouts, retain only those with minimal tool-induced errors or formatting issues (the cleanest successes)
- Downsample negative trajectories uniformly — preserve diverse failure modes as informative negative signal
The asymmetry is deliberate. Positive trajectories need quality filtering because the model should learn from clean reasoning, not from "stumbled to the right answer despite multiple code crashes." Negative trajectories need diversity preservation because understanding many ways to fail is more informative than understanding one failure mode well.
This connects to Does step-level confidence outperform global averaging for trace filtering? — both approaches recognize that not all correct trajectories are equally valuable for learning. It also extends Does RL training follow a predictable two-phase learning sequence? — tool use is a procedural capability that must consolidate (clean tool usage) before strategic reasoning can effectively build on it.
The results are striking: a 14B model reaches frontier-level math reasoning in only 510 RL steps within one week (64 MI300X GPUs), achieving 80.6% on AIME24 and 69.8% on AIME25 — surpassing DeepSeek-R1 (671B) with significantly shorter responses. The training recipe starts with non-reasoning SFT (instruction following + code tool usage + formatting only, no reasoning enhancement) to avoid SFT overfitting, then applies multi-stage RL with increasing difficulty and maximum length.
Inquiring lines that use this note as a source 18
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How does baseline capability level affect RL improvement ceiling?
- What makes some tasks bounded enough for reliable RL?
- How should learning environments balance error prevention with pedagogical value?
- Can removing failed branches from edited traces improve previous mistakes?
- What makes software engineering environments better suited for RL than other interactive domains?
- How does trajectory filtering handle noise when language models use code execution tools?
- How does training on correct answer form differ mechanistically from training on failure analysis?
- What failure modes do imitation and outcome methods each address?
- Why do sparse outcome rewards fail to credit correct tool calls in failed trajectories?
- Can model training address failures that really originate in harness gaps?
- Why do successful and failed trajectories need different memory processing?
- Can skill validation through testing prevent unreliable programs from accumulating?
- Why does evaluating errors teach more than imitating correct responses?
- Why does step-level expert alignment work when outcome-only RL fails?
- How do failure examples improve distillation compared to successful trajectories alone?
- How does error accumulation in workflows scale across multiple model calls?
- How do past research mistakes prevent future pivot loops from repeating them?
- Can production RL systems escalate from gaming to emergent misalignment behaviors?
Related concepts in this collection 5
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does step-level confidence outperform global averaging for trace filtering?
Explores whether measuring confidence at individual reasoning steps—rather than averaging across entire traces—better identifies and filters out low-quality reasoning. Matters because it could dramatically improve both accuracy and compute efficiency in multi-trace reasoning.
related quality-filtering principle applied at step level
-
Does RL training follow a predictable two-phase learning sequence?
This explores whether reinforcement learning exhibits consistent phases where basic execution skills must consolidate before strategic reasoning emerges. Understanding this sequence could reveal bottlenecks in scaling reasoning capabilities.
tool use as procedural capability that must consolidate before strategic reasoning
-
Why do correct reasoning traces contain fewer tokens?
In o1-like models, correct solutions are systematically shorter than incorrect ones for the same questions. This challenges assumptions that longer reasoning traces indicate better reasoning, and raises questions about what length actually signals.
GRPO-RoC's filtered positive trajectories are cleaner and shorter, consistent with this finding
-
Why does SFT-then-RL training follow a predictable three-phase pattern?
When expert data diverges from a model's learned patterns, SFT-then-RL training exhibits disruption, readaptation, and overfitting phases. Understanding this progression could improve how we combine imitation and reinforcement learning.
rStar2's non-reasoning SFT avoids the overfitting phase by not injecting reasoning patterns
-
Can reinforcement learning scale beyond single-turn language tasks?
Most RL for LLMs targets simple single-turn problems. This research asks whether RL can handle multi-turn interactive environments with sparse rewards and rich environmental feedback, like real software engineering tasks.
complementary agentic RL approach: rStar2 solves trajectory quality in code-tool environments through asymmetric filtering, while SWE-RL solves long-horizon credit assignment in multi-turn code tasks — together they address the two key challenges (noisy intermediate steps and sparse delayed rewards) that make agentic code RL harder than single-turn reasoning RL
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- rStar2-Agent: Agentic Reasoning Technical Report
- Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning
- ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models
- RLVMR: Reinforcement Learning with Verifiable Meta-Reasoning Rewards for Robust Long-Horizon Agents
- ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs
- Agent-R: Training Language Model Agents to Reflect via Iterative Self-Training
- Intrinsic Credit Assignment for Long Horizon Interaction
- LSR: Reinforcement Learning with Supervised Reward Outperforms SFT in Instruction Following
Original note title
agentic rl with code tools requires asymmetric trajectory filtering because environment noise in correct trajectories teaches the model to tolerate errors