INQUIRING LINE

What other downstream metrics could serve as RL reward sources?

This explores what signals beyond plain right-or-wrong correctness can drive reinforcement learning — the corpus turns out to hold a surprising range of alternative reward sources, each fixing a different blind spot of binary rewards.


This reads the question as: if 'did the model get it right?' is the default RL reward, what *other* measurable signals could we reward instead — and the corpus has a richer menu than you'd expect. The throughline across these notes is that binary correctness is not just incomplete, it's actively harmful in specific ways, which opens the door to reward sources that target what it misses.

The sharpest example: rewarding correctness alone teaches models to guess confidently, because a confident wrong answer is penalized no more than a hesitant one. Adding the **Brier score** — a measure of how well-calibrated the model's confidence is — as a second reward term provably fixes this without trading away accuracy Does binary reward training hurt model calibration?. So calibration itself is a downstream metric you can reward. In a similar spirit of reusing signals already lying around, **cross-rollout variance** — how much a model's multiple attempts at the same prompt disagree — can serve simultaneously as a token-level reward and as a filter for throwing out degenerate prompts, all without any human labels Can one statistical measure serve dual purposes in RL training?.

A second family abandons the scalar number entirely. A plain reward captures *how well* an action did but discards *how it should change* — feedback actually carries two orthogonal channels, evaluative and directive, and the directive part is recoverable through token-level distillation rather than a single score Can scalar rewards capture all the information in agent feedback?. **Natural-language critiques** push this further: models stuck on a numerical-reward plateau start solving problems once given chain-of-thought feedback about *why* they failed, signal that no scalar can encode Can natural language feedback overcome numerical reward plateaus?. And **reward models that reason before scoring** raise the evaluation ceiling by spending test-time compute on a critique trace rather than emitting an outcome judgment directly Can reward models benefit from reasoning before scoring?.

A third source is the *structure of the trajectory itself*. Instead of a single end-of-episode reward, you can mine dense step-level signals from how the rollout is shaped — tree-branching topology, tool-call positions, expert-aligned actions — converting sparse outcome rewards into process supervision with no annotated process reward model at all Can trajectory structure replace hand-annotated process rewards? Can tree structure alone convert outcome rewards into process supervision?. Rubrics offer yet another twist: they work better as *gates* that accept or reject whole rollout groups than as dense scores to optimize, which sidesteps reward hacking Can rubrics and dense rewards work together without hacking?. And you can even have an **LLM design the reward function**, by first solving a simplified deterministic version of the task and converting that plan into shaping rewards Can LLMs design reward functions for reinforcement learning?.

The thing worth carrying away: the choice of reward source quietly decides what RL can even do. Several notes find that verifiable-reward RL mostly sharpens sampling of abilities the base model already had rather than expanding its reasoning frontier Does RLVR actually expand what models can reason about? What does reward learning actually do to model reasoning? — so if you want RL to teach genuinely new behavior rather than just re-weight old behavior, the richer, more informative reward sources above (directive feedback, critiques, calibration) aren't a luxury, they're the lever.


Sources 11 notes

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Can one statistical measure serve dual purposes in RL training?

DRO reuses a single self-supervised statistic at two aggregation levels: token-level weighting in dense rewards and query-level filtering to discard degenerate comparisons. This dual use achieves 2–3× faster training with better stability on unverifiable tasks.

Can scalar rewards capture all the information in agent feedback?

Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.

Can natural language feedback overcome numerical reward plateaus?

Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.

Can reward models benefit from reasoning before scoring?

Three independent teams (RRM, RM-R1, DeepSeek-GRM) discovered that adding chain-of-thought reasoning before reward scoring enables adaptive test-time compute scaling for evaluation. Reasoning-based approaches raise the capability ceiling of reward models beyond what outcome-based evaluation achieves.

Can trajectory structure replace hand-annotated process rewards?

Tree-GRPO, Supervised RL, and ToolPO each convert sparse outcome rewards into dense step signals by exploiting different structural features—tree topology, expert-aligned actions, and tool-call positions—eliminating the need for annotated process reward models.

Can tree structure alone convert outcome rewards into process supervision?

Tree-GRPO uses branching structure to transform trajectory-level outcome rewards into step-level preference signals through sibling subtree comparison, eliminating the need for separate process reward models or step-level annotation while scaling with computational budget.

Can rubrics and dense rewards work together without hacking?

DRO shows that using rubrics to accept or reject rollout groups—rather than converting rubric scores into dense rewards—prevents reward hacking. This separation preserves the categorical strength of rubrics while letting token-level rewards optimize within valid answers.

Can LLMs design reward functions for reinforcement learning?

MEDIC shows that LLMs can generate effective reward shaping functions by first solving a deterministic, simplified version of the RL problem, then converting the resulting plan into shaping rewards for the original stochastic task. A model-based critic validates LLM outputs before deployment.

Does RLVR actually expand what models can reason about?

Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an RL-for-LLMs researcher auditing reward-source innovation. The question remains: what downstream metrics beyond binary correctness can ground RL training?

What a curated library found — and when (dated claims, not current truth):
Findings span May 2024–October 2025. The library identified:
- Binary correctness alone degrades calibration; Brier score as a second reward term recovers it without accuracy loss (~2024).
- Cross-rollout variance serves simultaneously as token-level reward and degenerate-prompt filter, no human labels needed (~2024).
- Natural-language critiques break RL plateaus where numerical rewards stall; chain-of-thought feedback encodes directives that scalars cannot (~2025).
- Reward models that reason before scoring extend test-time compute scaling to evaluation itself (~2025).
- Process supervision can be derived from trajectory structure (tree topology, tool calls, expert alignment) without annotated process reward models (~2025).
- Verifiable-reward RL mostly sharpens existing base-model capabilities rather than expanding reasoning boundaries; richer reward sources (directive feedback, calibration) are necessary to teach genuinely new behavior (~2025).

Anchor papers (verify; mind their dates):
- arXiv:2504.13837 (Apr 2025) – Does RL expand reasoning beyond base model?
- arXiv:2505.14674 (May 2025) – Reward Reasoning Model
- arXiv:2506.13351 (Jun 2025) – Direct Reasoning Optimization: rubric gates + token-level reflection
- arXiv:2507.14843 (Jul 2025) – RLVR capability boundaries

Your task:
(1) RE-TEST EACH CONSTRAINT. For every claim above, judge whether models trained in late 2025 or 2026, new scaling paradigms (mixture-of-rewards, hierarchical reward composition), or emerging tooling (reward model ensembles, dynamic reward weighting) have relaxed or overturned it. Separate the durable question (what *kinds* of signals teach new reasoning?) from perishable limits (e.g., 'reward models cannot reason' — now falsified by reasoning-based RMs). Cite what resolved each.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months — especially any showing that simple numerical rewards *do* scale reasoning, or that directive feedback is unnecessary.
(3) Propose 2 research questions that ASSUME the regime has shifted: one on multi-objective reward composition, one on whether reward expressiveness (not just source) is the true bottleneck.

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines