SYNTHESIS NOTE

Can we reward reasoning steps without human annotation?

Existing RL for reasoning uses only final-answer rewards, causing models to produce wastefully long chains. Can information theory provide dense, automatic feedback for individual reasoning steps?

Synthesis note · 2026-02-22 · sourced from Reasoning o1 o3 Search

"Learning to Think" (L2T) addresses the dense process reward problem — how to evaluate the contribution of individual reasoning steps without human annotation or task-specific evaluators — through information theory.

The key problem: existing RL methods for reasoning use only final outcome rewards. Under this sparse feedback, extending the chain incurs no cost. Even a tiny accuracy gain from many extra steps registers as a positive signal. Models develop a "one more thought" bias, consuming more than double the tokens actually needed for correct answers. On simple tasks (e.g., "12 + 5"), overly long chains can reduce accuracy — the redundant computation is not just wasteful but actively harmful.

L2T proposes a universal dense process reward with two components:

Fitting information gain: quantifies how much each reasoning episode contributes to capturing correctness-critical information in the model's parameters
Compression penalty: discourages excessive optimization, preserving efficiency

The reward is estimated via PAC-Bayes bounds and the Fisher information matrix, providing a tractable approximation with theoretical guarantees. Each query-response interaction is treated as a hierarchical session of multiple episodes. Upon each episode's completion, the reward is immediately computed — no waiting for the final answer.

This positions L2T as a third option in the ORM/PRM taxonomy. ORMs provide sparse outcome-only feedback (cheap but uninformative for intermediate steps). PRMs provide dense step-level feedback (informative but requires expensive annotation). L2T provides dense information-theoretic feedback (informative and annotation-free), with the trade-off being computational overhead for Fisher information estimation. The principle that dense process rewards outperform outcome-only signals extends beyond reasoning chains to agentic systems: Does supervising retrieval steps outperform final answer rewards? demonstrates the same finding in agentic RAG, where step-level retrieval rewards substantially improve search agent training over final-answer-only reward.

The task-dependence finding matters: moderate chain extensions improve coverage of critical steps on hard problems (Tier 4 multi-stage math), while the same extensions reduce accuracy on simple problems (Tier 1). No fixed chain length is optimal across tasks. This reinforces Can we allocate inference compute based on prompt difficulty? — the budget must be adaptive, and L2T provides the per-episode signal to enable that adaptation.

Inquiring lines that use this note as a source 18

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

18 direct connections · 170 in 2-hop network ·dense cluster Open in graph ↗

Can we reward reasoning steps without human anno… Why do outcome-based reward models fail at interme… Does more thinking time always improve reasoning a… Can we allocate inference compute based on prompt … Does supervising retrieval steps outperform final …

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Why do outcome-based reward models fail at intermediate step evaluation? Outcome-based reward models (ORMs) evaluate only final results, creating a mismatch with the need to assess reasoning quality at intermediate steps. Understanding this failure mode matters for building better AI reasoning systems.
L2T is a third option: dense + annotation-free
Does more thinking time always improve reasoning accuracy? Explores whether extending a model's thinking tokens linearly improves performance, or if there's a point beyond which additional reasoning becomes counterproductive.
L2T's compression penalty directly addresses the degradation mechanism: penalizing tokens that don't contribute information gain
Can we allocate inference compute based on prompt difficulty? Does adjusting how much compute each prompt receives—rather than using a fixed budget—improve model performance? Could smarter allocation let smaller models compete with larger ones?
L2T provides the per-episode signal that makes adaptive allocation possible
Does supervising retrieval steps outperform final answer rewards? Can intermediate feedback on retrieval decisions—which documents to fetch, when to stop—train agentic RAG systems more effectively than rewarding only the final answer? This matters because poor retrieval paths can accidentally succeed or good ones can fail on noisy metrics.
converging evidence from the agentic retrieval domain: RAG-Gym shows empirically that dense step-level rewards outperform outcome-only rewards for training search agents; L2T provides the information-theoretic framework that explains why -- per-episode information gain quantifies what outcome-only reward cannot

Can we reward reasoning steps without human annotation?

Related concepts in this collection 4

Related papers in this collection 8

Search by related questions 4