SYNTHESIS NOTE
Training, RL, and Test-Time Scaling

Can we reward reasoning steps without human annotation?

Existing RL for reasoning uses only final-answer rewards, causing models to produce wastefully long chains. Can information theory provide dense, automatic feedback for individual reasoning steps?

Synthesis note · 2026-02-22 · sourced from Reasoning o1 o3 Search
How should we allocate compute budget at inference time?

"Learning to Think" (L2T) addresses the dense process reward problem — how to evaluate the contribution of individual reasoning steps without human annotation or task-specific evaluators — through information theory.

The key problem: existing RL methods for reasoning use only final outcome rewards. Under this sparse feedback, extending the chain incurs no cost. Even a tiny accuracy gain from many extra steps registers as a positive signal. Models develop a "one more thought" bias, consuming more than double the tokens actually needed for correct answers. On simple tasks (e.g., "12 + 5"), overly long chains can reduce accuracy — the redundant computation is not just wasteful but actively harmful.

L2T proposes a universal dense process reward with two components:

The reward is estimated via PAC-Bayes bounds and the Fisher information matrix, providing a tractable approximation with theoretical guarantees. Each query-response interaction is treated as a hierarchical session of multiple episodes. Upon each episode's completion, the reward is immediately computed — no waiting for the final answer.

This positions L2T as a third option in the ORM/PRM taxonomy. ORMs provide sparse outcome-only feedback (cheap but uninformative for intermediate steps). PRMs provide dense step-level feedback (informative but requires expensive annotation). L2T provides dense information-theoretic feedback (informative and annotation-free), with the trade-off being computational overhead for Fisher information estimation. The principle that dense process rewards outperform outcome-only signals extends beyond reasoning chains to agentic systems: Does supervising retrieval steps outperform final answer rewards? demonstrates the same finding in agentic RAG, where step-level retrieval rewards substantially improve search agent training over final-answer-only reward.

The task-dependence finding matters: moderate chain extensions improve coverage of critical steps on hard problems (Tier 4 multi-stage math), while the same extensions reduce accuracy on simple problems (Tier 1). No fixed chain length is optimal across tasks. This reinforces Can we allocate inference compute based on prompt difficulty? — the budget must be adaptive, and L2T provides the per-episode signal to enable that adaptation.

Inquiring lines that use this note as a source 18

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
18 direct connections · 170 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

information-theoretic dense process rewards quantify episode-wise contribution to answer correctness — outcome-only rl produces more than double the needed tokens