Can multi-turn rewards fix models that lose track midway?
This explores whether reward signals designed for multi-step, multi-turn tasks can rescue models that drift, abandon a line of reasoning, or lose the thread partway through a long task.
This reads the question as two related failures bundled together: models that lose the thread *mechanically* (the reward arrives only at the end, so the model never learns which middle step went wrong) and models that lose it *behaviorally* (they wander, switch ideas, or run out of room to think). The corpus has surprisingly direct material on both.
The credit-assignment side has the cleanest answer. The classic problem with multi-turn tasks is that a single end-of-episode reward can't tell the model which of fifty steps mattered. Can full episode rewards per step enable better credit assignment? tackles this head-on by assigning the cumulative episode reward to *every* step and then normalizing across rollouts, which surfaces which action sequences actually drove success — a 3B model trained this way beat 72B baselines by 50%. Can step-wise expert rewards help small models learn hard reasoning? attacks the same gap from the dense end, rewarding step-by-step similarity to expert moves so the model gets signal even when every rollout fails. And Can reinforcement learning scale beyond single-turn language tasks? is the existence proof that this works at scale: modified DAPO training doubled SWE-bench performance on genuinely stateful, delayed-reward, multi-turn environments. So yes — better-placed rewards measurably fix the "which step lost it" failure.
But here's the twist the corpus surfaces: rewarding *steps* is hard precisely because the steps are messy. Why do standard process reward models fail on thinking traces? shows that standard process reward models break on real thinking traces — traces branch, backtrack, and revisit, and a naive grader reads those detours as errors rather than as productive exploration. The fix is to treat a model "losing track" not as noise to penalize but as information to supervise. That reframes the whole question: a reward that punishes wandering may train a worse model than one that learns *from* the wandering.
The behavioral side adds a sharper wrinkle. Do reasoning models switch between ideas too frequently? finds that o1-style models often lose track by abandoning reasoning paths too early — and you can fix it *without any reward training at all*, just a decoding-time penalty on thought-switching. Meanwhile Does limiting reasoning per turn improve multi-turn search quality? shows another non-reward cause: sometimes the model loses track because it literally burned its context budget thinking too hard on an early turn, leaving no room to absorb later evidence. Capping reasoning *per turn* preserves the thread. So multi-turn rewards aren't the only lever — and sometimes not the right one.
The most interesting thing for a curious reader is what scalar rewards *can't* do. Can scalar rewards capture all the information in agent feedback? argues that feedback actually carries two separable things — *evaluative* (how good was that?) and *directive* (here's how to fix it) — and a numeric reward only captures the first. A model losing track mid-task may need the directive half, which lives in the feedback text, not the score. That points past reward shaping entirely: the richest correction signal might be the language of the feedback, not its magnitude — a thread worth pulling in Can reward models benefit from reasoning before scoring?, where reward models get better by reasoning before they score.
Sources 8 notes
MS-GRPO assigns cumulative episode reward to each step, and group-relative normalization across rollouts surfaces which action sequences succeed. A 3B model post-trained this way outperforms 72B baselines by 50%, showing the training method matters more than scale for multi-step tasks.
Supervised Reinforcement Learning rewards models by measuring alignment with expert actions at each step, providing dense learning signals even when all rollouts fail. This approach bridges the gap between rigid token-by-token imitation (SFT) and sparse outcome-only rewards (RLVR), and works best as a curriculum foundation before outcome-based refinement.
Modified DAPO training doubled SWE-bench Verified performance from 20% to 39% on Qwen2.5-72B, matching larger models. This demonstrates RL works in stateful multi-step environments with delayed rewards and complex feedback, beyond theoretical single-turn MDPs.
Standard PRMs degrade on trajectory format because thinking traces include branching, backtracking, and weaker coherence than polished responses. ReasonFlux-PRM addresses this by supervising both trajectories and responses, treating failed steps as informative exploration rather than errors.
o1-like models frequently abandon reasoning paths mid-exploration, wasting tokens on incomplete approaches. A decoding-only penalty on thought-transition tokens (TIP strategy) discourages switching, improving accuracy on challenging math without model fine-tuning.
Unrestricted reasoning within single search turns consumes context needed for subsequent retrieval rounds, degrading the agent's ability to incorporate new evidence. Setting per-turn reasoning budgets, not just overall time limits, prevents this context erosion and maintains search quality across iterations.
Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.
Three independent teams (RRM, RM-R1, DeepSeek-GRM) discovered that adding chain-of-thought reasoning before reward scoring enables adaptive test-time compute scaling for evaluation. Reasoning-based approaches raise the capability ceiling of reward models beyond what outcome-based evaluation achieves.