SYNTHESIS NOTE
Training, RL, and Test-Time Scaling

Can next-token prediction become a reasoning task with RL?

Does reinforcement learning applied to next-token prediction during pretraining encourage genuine reasoning rather than surface memorization? This matters because it could unlock reasoning capability without requiring labeled data or human feedback.

Synthesis note · 2026-02-22 · sourced from RLVR
How should researchers navigate LLM reasoning research? What does reward learning actually do to model reasoning?

Reinforcement Pre-Training (RPT) bridges self-supervised pretraining and reinforcement learning by reframing next-token prediction as next-token reasoning. For any context in a pretraining corpus, the model is incentivized to reason about the subsequent token before predicting it, receiving a verifiable reward based on prediction correctness against the ground-truth next token.

This transforms the scalability bottleneck of RL for LLMs. Standard RLHF requires costly human preferences. RLVR requires domain-specific verifiable answers. RPT requires nothing beyond the pretraining corpus — the ground-truth next token is the verifiable reward. The entire internet becomes RL training data.

Three structural advantages emerge. First, the reward signal is rule-based (correct/incorrect next-token prediction), which inherently minimizes reward hacking — there is no learned reward model to exploit. Second, by encouraging reasoning patterns before each prediction, RPT promotes deeper understanding rather than surface memorization of token sequences. Third, the internal reasoning process allocates more computational effort per prediction step — a form of inference-time scaling applied at training time.

Since Can models learn reasoning from predicting any text?, RPT operates at the same granularity but with a fundamentally different mechanism. Quiet-STaR learns to generate useful rationales between tokens via a reinforcement signal. RPT learns to reason about what comes next via next-token verification. Both suggest that token-level reasoning during pretraining is a viable path to general reasoning capability.

The scaling curves show consistent improvement with increased training compute — more RPT training means better next-token prediction accuracy. RPT also provides a strong foundation for subsequent reinforcement fine-tuning, suggesting the reasoning patterns learned during pretraining compose with downstream RL rather than conflicting with it.

Inquiring lines that use this note as a source 14

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
15 direct connections · 113 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

reinforcement pre-training reframes next-token prediction as a reasoning task trained with rl — using the corpus itself as verifiable reward