Can we reward reasoning steps without human annotation?
Existing RL for reasoning uses only final-answer rewards, causing models to produce wastefully long chains. Can information theory provide dense, automatic feedback for individual reasoning steps?
"Learning to Think" (L2T) addresses the dense process reward problem — how to evaluate the contribution of individual reasoning steps without human annotation or task-specific evaluators — through information theory.
The key problem: existing RL methods for reasoning use only final outcome rewards. Under this sparse feedback, extending the chain incurs no cost. Even a tiny accuracy gain from many extra steps registers as a positive signal. Models develop a "one more thought" bias, consuming more than double the tokens actually needed for correct answers. On simple tasks (e.g., "12 + 5"), overly long chains can reduce accuracy — the redundant computation is not just wasteful but actively harmful.
L2T proposes a universal dense process reward with two components:
- Fitting information gain: quantifies how much each reasoning episode contributes to capturing correctness-critical information in the model's parameters
- Compression penalty: discourages excessive optimization, preserving efficiency
The reward is estimated via PAC-Bayes bounds and the Fisher information matrix, providing a tractable approximation with theoretical guarantees. Each query-response interaction is treated as a hierarchical session of multiple episodes. Upon each episode's completion, the reward is immediately computed — no waiting for the final answer.
This positions L2T as a third option in the ORM/PRM taxonomy. ORMs provide sparse outcome-only feedback (cheap but uninformative for intermediate steps). PRMs provide dense step-level feedback (informative but requires expensive annotation). L2T provides dense information-theoretic feedback (informative and annotation-free), with the trade-off being computational overhead for Fisher information estimation. The principle that dense process rewards outperform outcome-only signals extends beyond reasoning chains to agentic systems: Does supervising retrieval steps outperform final answer rewards? demonstrates the same finding in agentic RAG, where step-level retrieval rewards substantially improve search agent training over final-answer-only reward.
The task-dependence finding matters: moderate chain extensions improve coverage of critical steps on hard problems (Tier 4 multi-stage math), while the same extensions reduce accuracy on simple problems (Tier 1). No fixed chain length is optimal across tasks. This reinforces Can we allocate inference compute based on prompt difficulty? — the budget must be adaptive, and L2T provides the per-episode signal to enable that adaptation.
Inquiring lines that use this note as a source 18
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Can subjective tasks be delegated without human feedback loops?
- Why do process reward models need human annotation while MCTS intermediate nodes don't?
- What information do numerical rewards fail to provide for reasoning tasks?
- Can self-supervised methods replace human annotations for process reward models?
- Are RLVR models worse than non-reasoning models for subjective annotation?
- What role does self-learning play in improving agent reasoning without annotation?
- What multi-turn reward structures would encourage active intent discovery?
- Can trajectory structure alone provide process supervision without human annotation?
- How can process reward models handle branching and revisiting in reasoning traces?
- Can process supervision improve agentic RL through meta-reasoning rewards?
- Why do standard process reward models struggle with branching reasoning traces?
- Do self-supervised process reward models scale better than human annotation?
- How do verifier-free and adversarial approaches compare in extending reasoning RL?
- Can verifier-free RL work without manual preference labels or task-specific training?
- Why does prompting discover capabilities that need reward-driven refinement?
- How can verifier-free reinforcement learning handle reasoning without task-specific checks?
- What makes step-wise rewards denser than final-answer correctness signals?
- Can approximate or noisy reference answers work for RL-based reasoning training?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Why do outcome-based reward models fail at intermediate step evaluation?
Outcome-based reward models (ORMs) evaluate only final results, creating a mismatch with the need to assess reasoning quality at intermediate steps. Understanding this failure mode matters for building better AI reasoning systems.
L2T is a third option: dense + annotation-free
-
Does more thinking time always improve reasoning accuracy?
Explores whether extending a model's thinking tokens linearly improves performance, or if there's a point beyond which additional reasoning becomes counterproductive.
L2T's compression penalty directly addresses the degradation mechanism: penalizing tokens that don't contribute information gain
-
Can we allocate inference compute based on prompt difficulty?
Does adjusting how much compute each prompt receives—rather than using a fixed budget—improve model performance? Could smarter allocation let smaller models compete with larger ones?
L2T provides the per-episode signal that makes adaptive allocation possible
-
Does supervising retrieval steps outperform final answer rewards?
Can intermediate feedback on retrieval decisions—which documents to fetch, when to stop—train agentic RAG systems more effectively than rewarding only the final answer? This matters because poor retrieval paths can accidentally succeed or good ones can fail on noisy metrics.
converging evidence from the agentic retrieval domain: RAG-Gym shows empirically that dense step-level rewards outperform outcome-only rewards for training search agents; L2T provides the information-theoretic framework that explains why -- per-episode information gain quantifies what outcome-only reward cannot
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Reasoning Language Models: A Blueprint
- Learning to Think: Information-Theoretic Reinforcement Fine-Tuning for LLMs
- Training Large Language Models for Reasoning through Reverse Curriculum Reinforcement Learning
- StepWiser: Stepwise Generative Judges for Wiser Reasoning
- Understanding and Mitigating Premature Confidence for Better LLM Reasoning
- Intrinsic Credit Assignment for Long Horizon Interaction
- Learning to Reason without External Rewards
- Sycophancy Mitigation Through Reinforcement Learning with Uncertainty-Aware Adaptive Reasoning Trajectories
Original note title
information-theoretic dense process rewards quantify episode-wise contribution to answer correctness — outcome-only rl produces more than double the needed tokens