How does trajectory filtering handle noise when language models use code execution tools?

This explores how reinforcement learning copes with messy, error-filled runs when an LLM is calling code tools mid-reasoning — specifically the 'asymmetric trajectory filtering' idea that treats successful and failed attempts differently.

This explores how RL training copes with noisy runs when a model is interleaving reasoning with actual code execution — and the corpus has a sharp, specific answer at its center. When a model uses tools, a 'trajectory' is the whole transcript of thinking, code calls, and tool outputs that lead to an answer. The problem the corpus surfaces in Why do correct code trajectories teach models to tolerate errors? is subtle: a trajectory can reach the *correct* final answer while being full of bad intermediate steps — failed code calls, recovered errors, sloppy detours. If you reward those just because the answer was right, you teach the model that messiness is fine. The GRPO-RoC method handles this asymmetrically: it filters *positive* trajectories hard for clean reasoning quality, but keeps a *diverse* set of failures intact as negative signal. The asymmetry is the whole trick — you want pristine examples of what to imitate, but rich, varied examples of what to avoid. That let a 14B model reach frontier math performance in only 510 RL steps, beating much larger models.

Why noise is so dangerous here becomes clearer alongside Do frontier LLMs silently corrupt documents in long workflows?, which shows errors in long multi-step workflows compounding *silently* without ever plateauing — 25% of content corrupted over relay tasks. That's the failure mode trajectory filtering is fighting upstream: if you don't scrub the training signal, you bake error-tolerance into the model and it avalanches at inference time. Keeping failures as explicit negatives, rather than just discarding them, is how you inoculate against that drift instead of pretending it won't happen.

There's a deeper reason code execution matters at all, which Can models treat long prompts as external code environments? makes vivid: offloading work to a real Python environment sidesteps things the model genuinely can't do in its own head. That connects to Do large language models actually perform iterative optimization?, which found models can't actually run iterative numerical procedures internally — they pattern-match to memorized templates and emit plausible-but-wrong numbers. A code tool is the escape hatch from that limitation, but only if training rewards the model for *using the tool well* rather than for guessing. Trajectory filtering is what enforces 'use the tool, trust its output' over 'fake it convincingly.'

The boundary on all of this is named in What stops large language models from improving themselves?: self-improvement is formally capped by the generation-verification gap — every reliable fix needs something external to validate it. Code execution *is* that external verifier; the tool's pass/fail is ground truth the model can't hallucinate around. Asymmetric filtering, read this way, is a method for squeezing maximum signal out of that external check — keep the verified-clean wins, keep the instructive losses, throw away the lucky-but-dirty middle. If you want one more thread to pull, Why do trajectories matter more than individual examples for in-context learning? argues whole trajectories (not isolated examples) are what let models generalize across tasks — a reminder that the trajectory, not the final answer, is the real unit of learning here.

Sources 6 notes

Why do correct code trajectories teach models to tolerate errors?

GRPO-RoC filters positive trajectories for quality while preserving diverse failures as negative signal, allowing a 14B model to reach frontier math performance in 510 RL steps, surpassing much larger models with cleaner reasoning.

Do frontier LLMs silently corrupt documents in long workflows?

Testing 19 models across 52 domains shows even advanced systems degrade documents by ~25% over extended relay tasks, with errors compounding silently without plateauing through 50 round-trips.

Can models treat long prompts as external code environments?

Recursive Language Models store long prompts in a Python REPL and query them via code execution, avoiding attention degradation. RLMs outperform base models even on shorter prompts while handling inputs two orders of magnitude beyond context windows.

Do large language models actually perform iterative optimization?

Research shows LLMs cannot perform iterative procedures in latent space. They recognize optimization problems as template-similar and emit plausible-looking but incorrect values, a failure mode that persists across model scale and training approaches.

What stops large language models from improving themselves?

Self-improvement in LLMs is formally bounded by the generation-verification gap, meaning every reliable fix requires something external to validate and enforce it. Models cannot escape this constraint through metacognition alone.

Why do trajectories matter more than individual examples for in-context learning?

In-context learning for sequential decision-making requires full or partial trajectories from the same environment level, not just isolated examples. This structural property—trajectory burstiness—allows models to generalize across vastly different tasks without weight updates.

How does trajectory filtering handle noise when language models use code execution tools?

Sources 6 notes

Next inquiring lines