SYNTHESIS NOTE
Training, RL, and Test-Time Scaling

Can step-wise expert rewards help small models learn hard reasoning?

When small models fail on hard multi-step problems, can training them to match expert reasoning steps rather than final answers provide useful learning signals? This explores whether intermediate-step alignment might overcome the limitations of both supervised fine-tuning and outcome-based reinforcement learning.

Synthesis note · 2026-05-18 · sourced from Training Fine Tuning

Small open-source models hit a wall on hard multi-step reasoning problems. RLVR (Reinforcement Learning with Verifiable Rewards) fails when the model's success rate is effectively zero — no rollout produces the correct answer, and outcome-only supervision provides no positive signal. SFT (Supervised Fine-Tuning) overfits long demonstrations through rigid token-by-token imitation, particularly on small models where complex teacher traces exceed the student's representational capacity. Both methods fail on the same regime: small model, hard problem, no path to correctness through their standard supervision.

Supervised Reinforcement Learning (SRL) fills the gap. The framework reformulates problem-solving as generating a sequence of logical actions, with the model trained to produce an internal reasoning monologue before committing to each action. Rewards come not from final-answer correctness but from similarity between the model's actions and expert actions extracted from an SFT dataset, computed step-wise as the rollout proceeds.

The reward structure is the key shift. Outcome rewards are sparse and binary — correct or not. Step-wise similarity rewards are dense and smooth — partial credit for partial alignment with expert steps. The model receives useful signal even on problems where it never reaches the correct answer, because the gradient flows from incremental alignment with the demonstrated reasoning path rather than from final-answer matching.

This also addresses the SFT failure mode. SFT forces token-by-token imitation, which makes long expert traces brittle teaching examples for small models — one wrong predicted token derails the imitation. SRL operates at the action level, decomposing expert demonstrations into manageable steps. The model can be wrong about specific tokens while still receiving credit for action-level alignment.

The empirical result: SRL enables small models to learn problems previously unlearnable by SFT or RLVR. The method becomes most powerful as a curriculum component — SRL-then-RLVR initialization-and-refinement outperforms either method alone, with SRL building the foundation that RLVR can then sharpen.

Inquiring lines that use this note as a source 22

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
18 direct connections · 127 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

supervised RL provides step-wise expert-similarity rewards that yield learning signal even when all rollouts fail — bridges the SFT-RLVR gap for small models on hard reasoning