INQUIRING LINE

How do complete multi-turn trajectories differ from isolated task examples?

This explores what a model gains from seeing a full start-to-finish run of a task (every step, including the dead ends and recoveries) versus a pile of disconnected single examples — and why that difference shows up in how systems learn.


This explores what a full start-to-finish trajectory teaches that a stack of isolated examples can't. The short version from the corpus: a trajectory carries the *order and structure* of decisions, and that structure is itself the lesson — not incidental packaging around the answer.

The sharpest evidence is that in-context learning of sequential decision-making only works when the model sees full or partial trajectories from the same setting, not scattered one-off examples. This "trajectory burstiness" lets a model generalize to wildly different tasks without any weight updates Why do trajectories matter more than individual examples for in-context learning?. Isolated examples teach you what a good answer looks like; trajectories teach you how a good answer gets *built* — and it turns out the building process is what transfers. There's a striking parallel finding that instruction tuning on semantically empty or even wrong instructions still works almost as well as correct ones, because what the model actually absorbs is the shape of the output space, not task understanding Does instruction tuning teach task understanding or output format?. Put those two together and a theme emerges: models are far more sensitive to structure and format than to the literal content we think we're teaching.

The deeper payoff of complete trajectories is that they expose internal structure you can't see in a final answer. Within a single reasoning run, a few planning and backtracking sentences act as "thought anchors" that disproportionately steer everything downstream — the pivots are sparse and locatable, but only if you have the whole trace Which sentences actually steer a reasoning trace?. That same insight powers a whole family of training tricks: the *shape* of a trajectory — tree branches, tool-call positions, expert-aligned steps — can be mined for dense step-by-step reward signals, replacing hand-annotated process supervision entirely Can trajectory structure replace hand-annotated process rewards?. An isolated input-output pair has no shape to mine.

Trajectories also let you treat success and failure differently, which isolated examples flatten away. One approach keeps successful episodes as concrete demonstrations but distills failures into abstracted lessons — mirroring how human experts reason — and beats uniform processing while using far less context Should successful and failed episodes be processed differently?. This is why reinforcement learning scales to long, stateful, multi-turn software tasks with delayed rewards, roughly doubling SWE-bench performance, where single-turn framings simply don't apply Can reinforcement learning scale beyond single-turn language tasks?. And because trajectories unfold over turns, you can manage them over time: capping reasoning *per turn* preserves context for later retrieval rounds, something that only makes sense once you think in trajectories rather than snapshots Does limiting reasoning per turn improve multi-turn search quality?.

The thing you might not have expected to learn: trajectory-based training has its own internal *clock*. RL on full trajectories reliably moves through two phases — first execution correctness drives the learning, then strategic planning becomes the bottleneck — and you can get real gains by concentrating optimization on planning tokens once that shift happens Does RL training follow a predictable two-phase learning sequence?. Sequencing matters at the curriculum level too: imitating reasoning trajectories first, then sharpening against verifiable rewards, beats either alone because the imitation phase creates the reasonable rollouts the reward phase needs to be informative Does sequencing imitation then exploration training improve reasoning?. Isolated examples have no before-and-after. Complete trajectories are temporal objects — and that temporality is exactly what the learning hooks onto.


Sources 9 notes

Why do trajectories matter more than individual examples for in-context learning?

In-context learning for sequential decision-making requires full or partial trajectories from the same environment level, not just isolated examples. This structural property—trajectory burstiness—allows models to generalize across vastly different tasks without weight updates.

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

Which sentences actually steer a reasoning trace?

Counterfactual resampling, attention analysis, and causal suppression all identify planning and backtracking sentences as thought anchors—sparse critical points that guide subsequent reasoning. These are functional pivots, not noise.

Can trajectory structure replace hand-annotated process rewards?

Tree-GRPO, Supervised RL, and ToolPO each convert sparse outcome rewards into dense step signals by exploiting different structural features—tree topology, expert-aligned actions, and tool-call positions—eliminating the need for annotated process reward models.

Should successful and failed episodes be processed differently?

SkillRL demonstrates that treating successful episodes as concrete demonstrations and failures as abstracted lessons achieves state-of-the-art performance on complex tasks while using substantially less context than uniform approaches. The asymmetry mirrors human expert reasoning and avoids the degradation seen in uniform consolidation methods.

Can reinforcement learning scale beyond single-turn language tasks?

Modified DAPO training doubled SWE-bench Verified performance from 20% to 39% on Qwen2.5-72B, matching larger models. This demonstrates RL works in stateful multi-step environments with delayed rewards and complex feedback, beyond theoretical single-turn MDPs.

Does limiting reasoning per turn improve multi-turn search quality?

Unrestricted reasoning within single search turns consumes context needed for subsequent retrieval rounds, degrading the agent's ability to incorporate new evidence. Setting per-turn reasoning budgets, not just overall time limits, prevents this context erosion and maintains search quality across iterations.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Does sequencing imitation then exploration training improve reasoning?

Running Supervised RL first to establish reasoning foundations, then RLVR to refine against verifiable rewards, substantially outperforms both methods in isolation. The imitation phase makes outcome rewards informative by creating reasonable rollouts the RL phase can then sharpen.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about trajectory-based learning in LLMs. The question remains: what makes complete multi-turn trajectories more powerful teaching objects than isolated task examples?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2025. A trajectory carries *order and structure* that isolated examples lack:
• In-context learning of sequential decision-making only generalizes when models see full or partial trajectories from the same setting, not scattered one-off examples (~2023).
• Models absorb the *shape* of the output space and decision flow, not literal task understanding; instruction tuning works even with semantically empty instructions (~2023).
• Within a reasoning trace, a few "thought anchors" (planning and backtracking sentences) steer downstream outputs disproportionately; this structure is locatable only in full traces (~2025).
• Process supervision can be derived from structural features of agent trajectories, replacing hand-annotation (~2024–2025).
• RL on full trajectories exhibits a two-phase dynamic: execution correctness first, then strategic planning bottleneck; optimization can target planning tokens in phase two (~2025).
• RL on long-horizon software engineering tasks (multi-turn, delayed rewards) roughly doubles SWE-bench performance; single-turn framings do not apply (~2025).

Anchor papers (verify; mind their dates):
• arXiv:2312.03801 (2023): Generalization to New Sequential Decision Making Tasks with In-Context Learning
• arXiv:2506.19143 (2025): Thought Anchors: Which LLM Reasoning Steps Matter?
• arXiv:2508.03501 (2025): Training Long-Context, Multi-Turn Software Engineering Agents with Reinforcement Learning
• arXiv:2507.14783 (2025): Omni-Thinker: Scaling Multi-Task RL in LLMs with Hybrid Reward and Task Scheduling

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, judge whether newer training methods (e.g., test-time scaling, latent reasoning, hybrid reward scheduling), evaluation harnesses, or multi-agent orchestration have since relaxed or overturned it. Separate the durable question (what makes trajectory *structure* learnable?) from perishable limitations (e.g., phase-two bottlenecks, RL scaling ceilings). Cite what resolved each constraint and plainly flag where it still holds.
(2) Surface the strongest *contradicting or superseding* work from the last ~3 months. Look especially for papers claiming isolated task examples, curriculum flattening, or reward engineering can match or beat full-trajectory methods.
(3) Propose 2 research questions that *assume* the regime has moved — e.g., if thought anchors now scale to reasoning at inference time, what structure emerges in *failures*? If multi-task RL removes phase boundaries, what replaces them as learning bottlenecks?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines