INQUIRING LINE

Can multi-turn rewards fix models that lose track midway?

This explores whether reward signals designed for multi-step, multi-turn tasks can rescue models that drift, abandon a line of reasoning, or lose the thread partway through a long task.


This reads the question as two related failures bundled together: models that lose the thread *mechanically* (the reward arrives only at the end, so the model never learns which middle step went wrong) and models that lose it *behaviorally* (they wander, switch ideas, or run out of room to think). The corpus has surprisingly direct material on both.

The credit-assignment side has the cleanest answer. The classic problem with multi-turn tasks is that a single end-of-episode reward can't tell the model which of fifty steps mattered. Can full episode rewards per step enable better credit assignment? tackles this head-on by assigning the cumulative episode reward to *every* step and then normalizing across rollouts, which surfaces which action sequences actually drove success — a 3B model trained this way beat 72B baselines by 50%. Can step-wise expert rewards help small models learn hard reasoning? attacks the same gap from the dense end, rewarding step-by-step similarity to expert moves so the model gets signal even when every rollout fails. And Can reinforcement learning scale beyond single-turn language tasks? is the existence proof that this works at scale: modified DAPO training doubled SWE-bench performance on genuinely stateful, delayed-reward, multi-turn environments. So yes — better-placed rewards measurably fix the "which step lost it" failure.

But here's the twist the corpus surfaces: rewarding *steps* is hard precisely because the steps are messy. Why do standard process reward models fail on thinking traces? shows that standard process reward models break on real thinking traces — traces branch, backtrack, and revisit, and a naive grader reads those detours as errors rather than as productive exploration. The fix is to treat a model "losing track" not as noise to penalize but as information to supervise. That reframes the whole question: a reward that punishes wandering may train a worse model than one that learns *from* the wandering.

The behavioral side adds a sharper wrinkle. Do reasoning models switch between ideas too frequently? finds that o1-style models often lose track by abandoning reasoning paths too early — and you can fix it *without any reward training at all*, just a decoding-time penalty on thought-switching. Meanwhile Does limiting reasoning per turn improve multi-turn search quality? shows another non-reward cause: sometimes the model loses track because it literally burned its context budget thinking too hard on an early turn, leaving no room to absorb later evidence. Capping reasoning *per turn* preserves the thread. So multi-turn rewards aren't the only lever — and sometimes not the right one.

The most interesting thing for a curious reader is what scalar rewards *can't* do. Can scalar rewards capture all the information in agent feedback? argues that feedback actually carries two separable things — *evaluative* (how good was that?) and *directive* (here's how to fix it) — and a numeric reward only captures the first. A model losing track mid-task may need the directive half, which lives in the feedback text, not the score. That points past reward shaping entirely: the richest correction signal might be the language of the feedback, not its magnitude — a thread worth pulling in Can reward models benefit from reasoning before scoring?, where reward models get better by reasoning before they score.


Sources 8 notes

Can full episode rewards per step enable better credit assignment?

MS-GRPO assigns cumulative episode reward to each step, and group-relative normalization across rollouts surfaces which action sequences succeed. A 3B model post-trained this way outperforms 72B baselines by 50%, showing the training method matters more than scale for multi-step tasks.

Can step-wise expert rewards help small models learn hard reasoning?

Supervised Reinforcement Learning rewards models by measuring alignment with expert actions at each step, providing dense learning signals even when all rollouts fail. This approach bridges the gap between rigid token-by-token imitation (SFT) and sparse outcome-only rewards (RLVR), and works best as a curriculum foundation before outcome-based refinement.

Can reinforcement learning scale beyond single-turn language tasks?

Modified DAPO training doubled SWE-bench Verified performance from 20% to 39% on Qwen2.5-72B, matching larger models. This demonstrates RL works in stateful multi-step environments with delayed rewards and complex feedback, beyond theoretical single-turn MDPs.

Why do standard process reward models fail on thinking traces?

Standard PRMs degrade on trajectory format because thinking traces include branching, backtracking, and weaker coherence than polished responses. ReasonFlux-PRM addresses this by supervising both trajectories and responses, treating failed steps as informative exploration rather than errors.

Do reasoning models switch between ideas too frequently?

o1-like models frequently abandon reasoning paths mid-exploration, wasting tokens on incomplete approaches. A decoding-only penalty on thought-transition tokens (TIP strategy) discourages switching, improving accuracy on challenging math without model fine-tuning.

Does limiting reasoning per turn improve multi-turn search quality?

Unrestricted reasoning within single search turns consumes context needed for subsequent retrieval rounds, degrading the agent's ability to incorporate new evidence. Setting per-turn reasoning budgets, not just overall time limits, prevents this context erosion and maintains search quality across iterations.

Can scalar rewards capture all the information in agent feedback?

Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.

Can reward models benefit from reasoning before scoring?

Three independent teams (RRM, RM-R1, DeepSeek-GRM) discovered that adding chain-of-thought reasoning before reward scoring enables adaptive test-time compute scaling for evaluation. Reasoning-based approaches raise the capability ceiling of reward models beyond what outcome-based evaluation achieves.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a researcher re-testing claims about multi-turn reward training in LLMs. The question: *Can multi-turn rewards actually fix models that lose track midway — or have newer methods, model scales, or feedback architectures since made scalar rewards partially obsolete?*

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026; treat as perishable snapshots.
- Cumulative episode rewards assigned to every step beat 72B baselines by 50% on credit assignment (2025–26).
- Process reward models break on real branching/backtracking traces; treating "wandering" as productive exploration, not error, improves learning (2025–26).
- Decoding-time thought-switching penalties fix underthinking *without* reward training; o1-style models abandon reasoning paths too early (2025).
- Context-budget exhaustion (reasoning per turn, not per episode) causes mid-task loss-of-track independent of rewards (2025).
- Reward signals decompose into evaluative (score) and directive (corrective language) components; numeric rewards capture only the first (2025–26).

Anchor papers (verify; mind their dates):
- arXiv:2505.14674 (Reward Reasoning Model, 2025-05)
- arXiv:2506.18896 (ReasonFlux-PRM trajectory-aware PRMs, 2025-06)
- arXiv:2501.18585 (Underthinking in o1-like LLMs, 2025-01)
- arXiv:2602.12342 (Intrinsic Credit Assignment, 2026-02)

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, judge whether newer models (o1 successors?), orchestration (multi-agent memory/caching?), or richer feedback architectures have since relaxed or overturned it. Separate the durable question (multi-step credit assignment) from perishable limitations (scalar rewards sufficient; process RMs robust to branching). Cite what resolved each constraint; flag where it still holds.
(2) Surface the strongest contradicting or superseding work from the last ~6 months — anything suggesting scalar rewards *are* sufficient, or that non-reward levers (architecture, decoding, routing) render reward shaping secondary.
(3) Propose 2 research questions assuming the regime has moved: (a) If directive feedback (text, not scores) is the real bottleneck, how do you scale + supervise *reasoning-about-feedback* at inference time? (b) If per-turn context budgeting is the mechanical culprit, how do you design multi-turn tasks that neither starve later turns nor let early turns bloat?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines