INQUIRING LINE

Why do next-turn reward objectives fail to encourage multi-turn goal progress?

This explores credit assignment in multi-turn RL — why rewarding each turn on its immediate quality (a myopic, next-turn objective) doesn't add up to progress toward a goal that only resolves several turns later.


This explores credit assignment in multi-turn RL: why a reward that scores each turn on its own immediate merit fails to push an agent toward a goal that only pays off many turns down the line. The short version from the corpus is that a next-turn objective is *myopic* — it measures local correctness, but multi-turn success is a property of the whole trajectory, and there's no clean way to back-propagate "this turn helped us win three turns later" from a signal that only ever looks one step ahead.

The most direct rebuttal to next-turn rewards is to stop using them. MS-GRPO assigns the *cumulative episode reward* to every step and then normalizes across rollouts, so the training signal surfaces which whole action-sequences succeeded rather than which individual moves looked good in isolation — a 3B model trained this way beat 72B baselines by 50%, which says the credit-assignment scheme mattered more than scale Can full episode rewards per step enable better credit assignment?. The flip side is that pure outcome rewards are *sparse*: when every rollout fails, there's no gradient at all. Supervised RL threads this by giving dense step-wise rewards based on similarity to expert actions, so the model still learns from failed trajectories — sitting between rigid token imitation and outcome-only rewards Can step-wise expert rewards help small models learn hard reasoning?.

There's a deeper reason a scalar next-turn reward is structurally lossy. Agent feedback actually carries two orthogonal things: an *evaluative* signal (how good was that action) and a *directive* one (how should it change). A scalar reward captures the first and throws away the second — so even a well-shaped per-turn number can't tell the model which way to move next, only whether it did okay Can scalar rewards capture all the information in agent feedback?. That missing directional content is exactly what multi-turn progress needs.

The two-phase dynamic of RL training gives this a sharper edge. Across eight models, learning first masters *execution* correctness and only later hits a *strategic planning* bottleneck — planning-token entropy keeps rising while execution stabilizes Does RL training follow a predictable two-phase learning sequence?. A next-turn reward is great at the first phase (was this step done right?) and nearly blind to the second (was this step part of a good plan?). So it plateaus precisely where multi-turn goal progress lives. None of this means multi-turn RL is hopeless — modified DAPO doubled SWE-bench performance in exactly these stateful, delayed-reward settings Can reinforcement learning scale beyond single-turn language tasks? — but it got there by handling delayed credit, not by leaning harder on per-turn scoring.

Two adjacent failure modes are worth following if you want to go further. One is upstream: if the *user* or environment signal drifts across turns, the reward is corrupted before credit assignment even begins — goal-state tracking decomposes a goal into trackable sub-components to keep the signal coherent Why do LLM user simulators fail to track their own goals?. The other is the reward's own clarity: RL gains track how verifiable the reward is, so a fuzzy per-turn judgment barely moves the needle no matter how the horizon is framed Why does RL succeed more on some tasks than others?. Taken together, the corpus reframes the question: the problem isn't the *turn*, it's asking a single local scalar to carry information that's inherently global, directional, and strategic.


Sources 7 notes

Can full episode rewards per step enable better credit assignment?

MS-GRPO assigns cumulative episode reward to each step, and group-relative normalization across rollouts surfaces which action sequences succeed. A 3B model post-trained this way outperforms 72B baselines by 50%, showing the training method matters more than scale for multi-step tasks.

Can step-wise expert rewards help small models learn hard reasoning?

Supervised Reinforcement Learning rewards models by measuring alignment with expert actions at each step, providing dense learning signals even when all rollouts fail. This approach bridges the gap between rigid token-by-token imitation (SFT) and sparse outcome-only rewards (RLVR), and works best as a curriculum foundation before outcome-based refinement.

Can scalar rewards capture all the information in agent feedback?

Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Can reinforcement learning scale beyond single-turn language tasks?

Modified DAPO training doubled SWE-bench Verified performance from 20% to 39% on Qwen2.5-72B, matching larger models. This demonstrates RL works in stateful multi-step environments with delayed rewards and complex feedback, beyond theoretical single-turn MDPs.

Why do LLM user simulators fail to track their own goals?

The UGST framework breaks user goals into profile, policy, task, requirements, and preferences—each with explicit status tracking. A three-stage method (steering, SFT, GRPO) progressively internalizes goal alignment, reducing the misalignment that corrupts RL training signals.

Why does RL succeed more on some tasks than others?

Binary verifiable rewards enable dramatic RL gains (0.15% to 73.98%), while judgment-based evaluation yields modest improvements (55% reduction). Clear reward signals unlock suppressed capabilities; fuzzy signals barely move the needle.

Next inquiring lines