SYNTHESIS NOTE
Training, RL, and Test-Time Scaling

Can curriculum learning approximate expensive process supervision?

Can a reverse curriculum that slides backward from task completion provide step-level insight comparable to human process annotations, but at outcome supervision cost?

Synthesis note · 2026-02-22 · sourced from Reasoning Architectures

The core challenge of applying RL to complex reasoning: how do you provide meaningful supervision when the reasoning chain is long, errors compound across steps, and step-level annotation is expensive? R3 (Reverse Curriculum Reinforcement Learning) solves this without human-annotated process supervision.

The mechanism: Instead of having the model reason from scratch (leading to sparse rewards and exponential search space), R3 starts the model from a state sampled from near the end of a correct demonstration. The model has already learned to solve most of the remaining chain; it only needs to generate the final few steps. Outcome supervision (correct or not) then provides informative feedback because success probability is high.

The start state then progressively slides backward toward the beginning of the demonstration. At each step, the model is reasonably likely to succeed (because it has already learned to solve everything ahead of it), and failure is informative (because the model was competent on the downstream steps). This creates a curriculum of gradually increasing exploration difficulty.

Why this approximates process supervision: Each position in the sliding curriculum implicitly tests the model on that specific step's difficulty. A model that succeeds at start position k but fails at start position k-1 has revealed that step k-1 is where its reasoning breaks down — even though only outcome supervision is used. The curriculum resolution increases with the granularity of start positions sampled.

The two-mode comparison:

This is a practical solution to the trade-off documented in Why do outcome-based reward models fail at intermediate step evaluation?.

Inquiring lines that use this note as a source 38

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
14 direct connections · 98 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

reverse curriculum rl approximates process supervision by progressively sliding the reasoning start state backward from near-completion