Can scalar rewards capture all the information in agent feedback?
Exploring whether numerical rewards alone can preserve both the evaluative judgment and directional guidance embedded in natural feedback—or if something crucial gets lost in the conversion.
The OpenClaw-RL framework makes a decomposition that was implicit in prior agentic RL work but never formalized: when an agent acts and the environment responds, the response carries two distinct kinds of information. The evaluative signal scores the action — how well did it perform — and can be extracted as a scalar reward via a PRM judge. The directive signal specifies how the action should have been different — not just that it was wrong, but in what direction. These are orthogonal: high-quality directive information can accompany any evaluation, and scalar rewards systematically lose the directive component.
Consider a user who says "you should have checked the file first." The evaluative content is approximately -1 (the response was inadequate). But the directive content is token-level specific: check the file first. A PRM judge can convert the sentiment into a scalar, but the sequence-level correction vanishes into a single number. Similarly, a detailed SWE error trace often implies a concrete correction direction that scalar outcome rewards cannot convey. Current RLVR methods operate on scalar rewards (Does RLVR actually expand what models can reason about?) and cannot convert directive information into a directional policy gradient. Distillation methods can process structured corrections but require pre-curated feedback-response pairs rather than live signals.
OpenClaw-RL recovers the directive signal through Hindsight-Guided On-Policy Distillation (OPD): extract textual hints from the next state, construct an enhanced teacher context by injecting those hints, and distill token-level directional advantage back into the student policy. This is richer than any scalar reward because it teaches the model not just "that was wrong" but "here is what right looks like in these specific tokens." The empirical result — combining binary PRM-based RL with OPD via weighted loss yields significant gains over either alone — confirms the two signals are complementary, not redundant.
This decomposition matters beyond OpenClaw-RL because it clarifies a conceptual muddle in agentic RL. When people debate "should we use outcome rewards or process rewards, scalar or verbal," the answer is usually "both, decomposed properly." The outcome-vs-process trade-off (Why do outcome-based reward models fail at intermediate step evaluation?) assumes a single signal type. The scalar-vs-verbal distinction is treated as architectural (Can natural language feedback overcome numerical reward plateaus?). OpenClaw-RL reframes them as two projections of one signal: evaluative (dense scalar) and directive (token-level).
The generalization: any learning loop that reduces natural feedback to scalars is discarding the fraction of training signal that most resembles supervised learning. A corrective sentence contains its own teacher.
Inquiring lines that use this note as a source 145
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Do explicit reward structures enable AI agent cooperation that open-ended interaction cannot?
- What cognitive capabilities do agents need to internalize social feedback?
- Can unified policies handle negative feedback and critique transformation simultaneously?
- Can systems recognize and abstain on judgments rather than hallucinating preferences?
- How does unidimensionality in assessments affect measurement validity?
- How does credit assignment drive agents to write information into environments?
- Why does binary reward forcing degrade model calibration?
- Do spurious rewards activate reasoning without teaching new skills?
- How does RLHF reward structure incentivize agreement over accuracy?
- Why do weak belief tracking and conservative actions trap agents in low-information states?
- How do outcome and process rewards differ in their treatment of intermediate steps?
- Can distillation methods extract directional guidance that scalar RL cannot access?
- How can consistency across measurement conditions identify genuine versus constructed preferences?
- Does in-distribution reward model performance hide failures from context shift?
- How does partial information exposure create feedback loops that deepen knowledge gaps?
- How do reward model ensembles improve robustness to miscalibration?
- Can importance sampling reduce variance in off-policy reward estimation?
- Can reward engineering and information-theoretic architecture solve partner-awareness separately?
- What makes trajectory more actionable than absolute scores for human moderators?
- Can multi-turn rewards fix models that lose track midway?
- Why does natural language feedback break performance plateaus that numerical rewards alone cannot?
- Can solution traces substitute for process-level reward signals in math reasoning?
- Can reward model training be automated without changing feedback mechanisms?
- What information do next-state signals contain beyond what scalar rewards capture?
- Do outcome-only reward signals miss step-level errors that compound later?
- Do agents prefer raw experience over condensed summaries of past actions?
- How do implicit signals like clicks capture preference more reliably than explicit ratings?
- How do graduated phase rewards emerge complex dialogue behavior from simple objectives?
- What makes process-level supervision better than outcome-only reward signals?
- What are the ten intrinsic motivation heuristics that drive participation decisions?
- Can curiosity rewards about user type complement general social motivation frameworks?
- Can subjective tasks be delegated without human feedback loops?
- How does process-focused feedback compare to outcome-focused feedback in skill training?
- Does social scaffolding outperform purely intrinsic motivation for agent exploration?
- How does modularity in reward and policy design enable goal generalization?
- At what capability level does the generation-verification gap make intrinsic rewards insufficient?
- How does reward model training permit spurious correlations in scoring?
- Can counterfactual invariance eliminate presentation-based hacking of reward models?
- Can reward models trained for engagement fix the informativeness problem?
- What distinguishes verifiable rewards from preference-based rewards in unified training?
- How does implicit feedback structure differ from explicit ratings mathematically?
- How do semantic reward shaping approaches compare to full critique models?
- What information do numerical rewards fail to provide for reasoning tasks?
- How does negative reinforcement redistribute probability without guiding toward correct answers?
- Is elaborate reward shaping necessary if the pretrained prior already contains good solutions?
- How do evaluative versus directive signals differ in next-state training?
- How do process-level rewards compare to environment-extracted next-state signals?
- Can architectural changes like decoupling intent understanding help overcome next-turn reward limitations?
- Why do generative reward models produce more interpretable evaluations than scalar scores?
- Can UCB-style bonuses over outcome space prevent policy entropy collapse?
- How does textual-only feedback limit what a persona can learn about users?
- Can model confidence signals replace explicit external reward functions?
- Why do reward models fail when they ignore the prompt context?
- Can negative feedback through critiques achieve the same steering flexibility as positive preferences?
- How do reward model biases cascade into downstream optimization failures?
- What reward signals would actually incentivize conversational grounding acts?
- Why do next-turn reward objectives fail to encourage multi-turn goal progress?
- How does asymmetric information between users and agents relate to proactivity?
- How can reward structures teach models when to speak and when to stay silent?
- Can programmatic meta-reasoning rewards operationalize agentic process supervision?
- What information-theoretic framework explains why process rewards beat outcome only?
- Can agents revise their beliefs predictably when presented with interventions?
- What preference dimensions do base reward functions typically capture?
- Can we distinguish between genuine alignment and response quality bias in reward signals?
- What distinguishes generative reward models from outcome-based and process-based approaches?
- How do task-type perceptions like chat versus reasoning guide different reward strategies?
- What design changes if we separate behavior description from adoption justification goals?
- How do reward models benefit from extended thinking during evaluation scoring?
- How does next-turn reward optimization contribute to agent passivity?
- Can structured natural language feedback outperform scalar rewards in RL?
- Why do agents fail to internalize value from informative observations?
- Can alternative reward functions shift LLMs from problem-solving to genuinely empathic responses?
- Why do spurious rewards work nearly as well as correct ones?
- How do confidence signals differ between implicit feedback and explicit ratings?
- How much actionable detail does condensation strip from raw experience?
- How do outcome-based and process-based reward models differ in supervision cost?
- Can multi-turn aware rewards improve alignment beyond single-turn helpfulness?
- What makes Effective Rank Acceleration a stable training signal for dual-channel incentives?
- Can decomposing rewards into prompt-free and prompt-related components fix this blindspot?
- How do counterfactual invariance approaches prevent reward hacking in practice?
- What separates bootstrapping gains from sustained self-improvement gains?
- Can reward design fix the conflict between reasoning accuracy and abstention calibration?
- Can behavior-level emotion rewards maintain factual reliability in emotional contexts?
- What deployment modes work best for trajectory-aware reward signals?
- Can reward factorization represent trade-offs between conflicting moral values?
- Why do completion-mode strengths not transfer to agentic settings?
- How do delayed effects complicate causal attribution in agent systems?
- How does reinforcement learning on outcomes reinforce template-matching rather than computation?
- How do reward features learned from group data generalize to new users?
- Can environmental rewards directly refine natural language descriptions of actions?
- How does information asymmetry between teacher and student create the learning signal?
- How do composite rewards attribute curation outcomes to specific skill library changes?
- Why do sparse outcome rewards fail to credit correct tool calls in failed trajectories?
- What reward mechanisms make thinking-based compression budget-controllable and reliable?
- How does belief-shift reward compare to curiosity-driven and process reward approaches?
- Can an agent's internal probabilities serve as value signals across domains?
- How do checklists prevent reward models from exploiting superficial response artifacts?
- How do human-agent systems incorporate diverse feedback into model behavior?
- How do you prevent stale reward signals when skills evolve during deployment?
- Does self-play feedback improve skills created from the agent's own experience?
- How do reward models as policy discriminators differ from labeled preferences?
- What makes exploration and reflection rewards verifiable in agentic environments?
- Why does scalarization of rewards fail for multi-objective GRPO training?
- How does credit assignment across objectives differ from credit assignment across time?
- Can vector-valued rewards preserve specialization better than variance-weighted advantages?
- How should multi-objective post-training balance competing behavioral goals?
- How does tree-search topology convert outcome rewards into intermediate supervision?
- What other downstream metrics could serve as RL reward sources?
- Can user preferences be represented as linear reward combinations?
- How do you extract reward signals when all rollouts fail?
- Can reward models distinguish between personal preference and community consensus?
- Can verifiable rewards during pretraining replace costly human preference labeling?
- How do relational reward signals compare to absolute preference encodings in RL?
- Do personalized reward models work better than one-size-fits-all approaches?
- How does in-context feedback integration differ from learned reward signals?
- Can early experience replace external rewards as a learning signal?
- Why do veto mechanisms on critical dimensions prevent collapse into exploitable reward modes?
- How do token-level rewards and rubric gates serve different statistical functions?
- Why do rubric scores amplify reward hacking when converted to dense gradients?
- Why does information asymmetry between teacher and student enable effective feedback learning?
- What other adaptive internal phenomena could signal system behavior improvements?
- Can structured rewards still teach models when spurious rewards also work?
- What role does task structure play in rewarding delayed thinking?
- Can tree-GRPO work with extremely noisy or sparse outcome reward signals?
- What makes reward models fundamentally different from policy discriminators?
- What makes binary rewards more effective than richer reward signals?
- How does DVAO balance reward components differently than VPO spreads them?
- When does a task lack a meaningful multi-dimensional reward structure?
- Can rich environment feedback replace human preference labels entirely?
- How does belief-shift credit assignment compare to process reward models?
- What alignment properties emerge when the reward model disappears?
- Does pairwise self-judgment avoid reward model scaling problems?
- What makes reward signal sources substitutable across verifier-free RL patterns?
- How do aggregate reward models systematically exclude minority perspectives?
- Why does externalizing bookkeeping raise effective feedback compute?
- What makes user-decision rewards better than model-confidence rewards?
- What hidden signals in agent logs reveal about frontier capability beyond pass-fail outcomes?
- Do process reward models need different supervision strategies by domain?
- Can trajectory structure replace hand-annotated process reward models entirely?
- Does the generation-verification gap define where self-rewarding actually works?
- Can agents escape weak belief tracking and conservative action selection traps?
- How does process-based reward differ from outcome-only reward in training?
- Do information gathering and task execution require different incentive structures?
- What makes advantage shaping more stable than reward shaping for tool training?
- How do aggregate reward models systematically exclude minority preferences?
Related concepts in this collection 6
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can agent deployment itself generate training signals automatically?
Can we extract learning signals from the natural next-states that agents encounter during real deployment—user replies, tool outputs, test verdicts—rather than relying on separate annotation pipelines? This reframes how agents improve continuously.
the framing this decomposition operates within
-
Can natural language feedback overcome numerical reward plateaus?
Exploring whether chain-of-thought critiques can push past performance ceilings that scaling data alone cannot break in reinforcement learning for reasoning tasks.
establishes that verbal feedback contains information scalars cannot reach
-
Does binary reward training hurt model calibration?
Explores whether the standard correctness-based reward in RL training creates incentives for overconfident predictions, and what structural problem causes calibration to degrade during optimization.
another case where single-scalar objectives miss structure
-
Why do outcome-based reward models fail at intermediate step evaluation?
Outcome-based reward models (ORMs) evaluate only final results, creating a mismatch with the need to assess reasoning quality at intermediate steps. Understanding this failure mode matters for building better AI reasoning systems.
the outcome/process axis is the wrong cut; evaluative/directive is closer to the information structure
-
Does critiquing errors teach deeper understanding than imitating correct answers?
Can training models to critique flawed responses build better structural understanding than standard supervised fine-tuning on correct answers? This matters because it reveals whether deep reasoning requires engaging with failure modes rather than pattern matching.
critique-based training as a cousin: teaching the model the directive structure behind errors
-
Does RLVR actually expand what models can reason about?
Explores whether reinforcement learning from verifiable rewards teaches models genuinely new reasoning skills or simply makes existing capabilities more reliable. Pass@k analysis suggests the latter.
scalar RLVR's structural ceiling that directive signals may penetrate
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Reward Reasoning Model
- OpenClaw-RL: Train Any Agent Simply by Talking
- A Survey of Reinforcement Learning from Human Feedback
- Reinforcement Learning via Self-Distillation
- Information-Theoretic Reward Decomposition for Generalizable RLHF
- Can Large Language Models Reason and Optimize Under Constraints?
- Foundations of Large Language Models
- Online Intrinsic Rewards for Decision Making Agents from Large Language Model Feedback
Original note title
agent next-state signals decompose into evaluative and directive information that scalar rewards cannot jointly capture