Can scalar rewards capture all the information in agent feedback?

Exploring whether numerical rewards alone can preserve both the evaluative judgment and directional guidance embedded in natural feedback—or if something crucial gets lost in the conversion.

Synthesis note · 2026-04-07 · sourced from Autonomous Agents

The OpenClaw-RL framework makes a decomposition that was implicit in prior agentic RL work but never formalized: when an agent acts and the environment responds, the response carries two distinct kinds of information. The evaluative signal scores the action — how well did it perform — and can be extracted as a scalar reward via a PRM judge. The directive signal specifies how the action should have been different — not just that it was wrong, but in what direction. These are orthogonal: high-quality directive information can accompany any evaluation, and scalar rewards systematically lose the directive component.

Consider a user who says "you should have checked the file first." The evaluative content is approximately -1 (the response was inadequate). But the directive content is token-level specific: check the file first. A PRM judge can convert the sentiment into a scalar, but the sequence-level correction vanishes into a single number. Similarly, a detailed SWE error trace often implies a concrete correction direction that scalar outcome rewards cannot convey. Current RLVR methods operate on scalar rewards (Does RLVR actually expand what models can reason about?) and cannot convert directive information into a directional policy gradient. Distillation methods can process structured corrections but require pre-curated feedback-response pairs rather than live signals.

OpenClaw-RL recovers the directive signal through Hindsight-Guided On-Policy Distillation (OPD): extract textual hints from the next state, construct an enhanced teacher context by injecting those hints, and distill token-level directional advantage back into the student policy. This is richer than any scalar reward because it teaches the model not just "that was wrong" but "here is what right looks like in these specific tokens." The empirical result — combining binary PRM-based RL with OPD via weighted loss yields significant gains over either alone — confirms the two signals are complementary, not redundant.

This decomposition matters beyond OpenClaw-RL because it clarifies a conceptual muddle in agentic RL. When people debate "should we use outcome rewards or process rewards, scalar or verbal," the answer is usually "both, decomposed properly." The outcome-vs-process trade-off (Why do outcome-based reward models fail at intermediate step evaluation?) assumes a single signal type. The scalar-vs-verbal distinction is treated as architectural (Can natural language feedback overcome numerical reward plateaus?). OpenClaw-RL reframes them as two projections of one signal: evaluative (dense scalar) and directive (token-level).

The generalization: any learning loop that reduces natural feedback to scalars is discarding the fraction of training signal that most resembles supervised learning. A corrective sentence contains its own teacher.

Inquiring lines that use this note as a source 145

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 6

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

17 direct connections · 140 in 2-hop network ·medium cluster Open in graph ↗

Can scalar rewards capture all the information i… Can agent deployment itself generate training sign… Can natural language feedback overcome numerical r… Does binary reward training hurt model calibration… Why do outcome-based reward models fail at interme… Does critiquing errors teach deeper understanding … Does RLVR actually expand what models can reason a…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can agent deployment itself generate training signals automatically? Can we extract learning signals from the natural next-states that agents encounter during real deployment—user replies, tool outputs, test verdicts—rather than relying on separate annotation pipelines? This reframes how agents improve continuously.
the framing this decomposition operates within
Can natural language feedback overcome numerical reward plateaus? Exploring whether chain-of-thought critiques can push past performance ceilings that scaling data alone cannot break in reinforcement learning for reasoning tasks.
establishes that verbal feedback contains information scalars cannot reach
Does binary reward training hurt model calibration? Explores whether the standard correctness-based reward in RL training creates incentives for overconfident predictions, and what structural problem causes calibration to degrade during optimization.
another case where single-scalar objectives miss structure
Why do outcome-based reward models fail at intermediate step evaluation? Outcome-based reward models (ORMs) evaluate only final results, creating a mismatch with the need to assess reasoning quality at intermediate steps. Understanding this failure mode matters for building better AI reasoning systems.
the outcome/process axis is the wrong cut; evaluative/directive is closer to the information structure
Does critiquing errors teach deeper understanding than imitating correct answers? Can training models to critique flawed responses build better structural understanding than standard supervised fine-tuning on correct answers? This matters because it reveals whether deep reasoning requires engaging with failure modes rather than pattern matching.
critique-based training as a cousin: teaching the model the directive structure behind errors
Does RLVR actually expand what models can reason about? Explores whether reinforcement learning from verifiable rewards teaches models genuinely new reasoning skills or simply makes existing capabilities more reliable. Pass@k analysis suggests the latter.
scalar RLVR's structural ceiling that directive signals may penetrate

Can scalar rewards capture all the information in agent feedback?

Related concepts in this collection 6

Related papers in this collection 8

Search by related questions 4