How do complete multi-turn trajectories differ from isolated task examples?
This explores what a model gains from seeing a full start-to-finish run of a task (every step, including the dead ends and recoveries) versus a pile of disconnected single examples — and why that difference shows up in how systems learn.
This explores what a full start-to-finish trajectory teaches that a stack of isolated examples can't. The short version from the corpus: a trajectory carries the *order and structure* of decisions, and that structure is itself the lesson — not incidental packaging around the answer.
The sharpest evidence is that in-context learning of sequential decision-making only works when the model sees full or partial trajectories from the same setting, not scattered one-off examples. This "trajectory burstiness" lets a model generalize to wildly different tasks without any weight updates Why do trajectories matter more than individual examples for in-context learning?. Isolated examples teach you what a good answer looks like; trajectories teach you how a good answer gets *built* — and it turns out the building process is what transfers. There's a striking parallel finding that instruction tuning on semantically empty or even wrong instructions still works almost as well as correct ones, because what the model actually absorbs is the shape of the output space, not task understanding Does instruction tuning teach task understanding or output format?. Put those two together and a theme emerges: models are far more sensitive to structure and format than to the literal content we think we're teaching.
The deeper payoff of complete trajectories is that they expose internal structure you can't see in a final answer. Within a single reasoning run, a few planning and backtracking sentences act as "thought anchors" that disproportionately steer everything downstream — the pivots are sparse and locatable, but only if you have the whole trace Which sentences actually steer a reasoning trace?. That same insight powers a whole family of training tricks: the *shape* of a trajectory — tree branches, tool-call positions, expert-aligned steps — can be mined for dense step-by-step reward signals, replacing hand-annotated process supervision entirely Can trajectory structure replace hand-annotated process rewards?. An isolated input-output pair has no shape to mine.
Trajectories also let you treat success and failure differently, which isolated examples flatten away. One approach keeps successful episodes as concrete demonstrations but distills failures into abstracted lessons — mirroring how human experts reason — and beats uniform processing while using far less context Should successful and failed episodes be processed differently?. This is why reinforcement learning scales to long, stateful, multi-turn software tasks with delayed rewards, roughly doubling SWE-bench performance, where single-turn framings simply don't apply Can reinforcement learning scale beyond single-turn language tasks?. And because trajectories unfold over turns, you can manage them over time: capping reasoning *per turn* preserves context for later retrieval rounds, something that only makes sense once you think in trajectories rather than snapshots Does limiting reasoning per turn improve multi-turn search quality?.
The thing you might not have expected to learn: trajectory-based training has its own internal *clock*. RL on full trajectories reliably moves through two phases — first execution correctness drives the learning, then strategic planning becomes the bottleneck — and you can get real gains by concentrating optimization on planning tokens once that shift happens Does RL training follow a predictable two-phase learning sequence?. Sequencing matters at the curriculum level too: imitating reasoning trajectories first, then sharpening against verifiable rewards, beats either alone because the imitation phase creates the reasonable rollouts the reward phase needs to be informative Does sequencing imitation then exploration training improve reasoning?. Isolated examples have no before-and-after. Complete trajectories are temporal objects — and that temporality is exactly what the learning hooks onto.
Sources 9 notes
In-context learning for sequential decision-making requires full or partial trajectories from the same environment level, not just isolated examples. This structural property—trajectory burstiness—allows models to generalize across vastly different tasks without weight updates.
Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.
Counterfactual resampling, attention analysis, and causal suppression all identify planning and backtracking sentences as thought anchors—sparse critical points that guide subsequent reasoning. These are functional pivots, not noise.
Tree-GRPO, Supervised RL, and ToolPO each convert sparse outcome rewards into dense step signals by exploiting different structural features—tree topology, expert-aligned actions, and tool-call positions—eliminating the need for annotated process reward models.
SkillRL demonstrates that treating successful episodes as concrete demonstrations and failures as abstracted lessons achieves state-of-the-art performance on complex tasks while using substantially less context than uniform approaches. The asymmetry mirrors human expert reasoning and avoids the degradation seen in uniform consolidation methods.
Modified DAPO training doubled SWE-bench Verified performance from 20% to 39% on Qwen2.5-72B, matching larger models. This demonstrates RL works in stateful multi-step environments with delayed rewards and complex feedback, beyond theoretical single-turn MDPs.
Unrestricted reasoning within single search turns consumes context needed for subsequent retrieval rounds, degrading the agent's ability to incorporate new evidence. Setting per-turn reasoning budgets, not just overall time limits, prevents this context erosion and maintains search quality across iterations.
Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.
Running Supervised RL first to establish reasoning foundations, then RLVR to refine against verifiable rewards, substantially outperforms both methods in isolation. The imitation phase makes outcome rewards informative by creating reasonable rollouts the RL phase can then sharpen.