SYNTHESIS NOTE

Can next-token prediction become a reasoning task with RL?

Does reinforcement learning applied to next-token prediction during pretraining encourage genuine reasoning rather than surface memorization? This matters because it could unlock reasoning capability without requiring labeled data or human feedback.

Synthesis note · 2026-02-22 · sourced from RLVR

Reinforcement Pre-Training (RPT) bridges self-supervised pretraining and reinforcement learning by reframing next-token prediction as next-token reasoning. For any context in a pretraining corpus, the model is incentivized to reason about the subsequent token before predicting it, receiving a verifiable reward based on prediction correctness against the ground-truth next token.

This transforms the scalability bottleneck of RL for LLMs. Standard RLHF requires costly human preferences. RLVR requires domain-specific verifiable answers. RPT requires nothing beyond the pretraining corpus — the ground-truth next token is the verifiable reward. The entire internet becomes RL training data.

Three structural advantages emerge. First, the reward signal is rule-based (correct/incorrect next-token prediction), which inherently minimizes reward hacking — there is no learned reward model to exploit. Second, by encouraging reasoning patterns before each prediction, RPT promotes deeper understanding rather than surface memorization of token sequences. Third, the internal reasoning process allocates more computational effort per prediction step — a form of inference-time scaling applied at training time.

Since Can models learn reasoning from predicting any text?, RPT operates at the same granularity but with a fundamentally different mechanism. Quiet-STaR learns to generate useful rationales between tokens via a reinforcement signal. RPT learns to reason about what comes next via next-token verification. Both suggest that token-level reasoning during pretraining is a viable path to general reasoning capability.

The scaling curves show consistent improvement with increased training compute — more RPT training means better next-token prediction accuracy. RPT also provides a strong foundation for subsequent reinforcement fine-tuning, suggesting the reasoning patterns learned during pretraining compose with downstream RL rather than conflicting with it.

Inquiring lines that use this note as a source 14

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

15 direct connections · 113 in 2-hop network ·medium cluster Open in graph ↗

Can next-token prediction become a reasoning tas… Can models learn reasoning from predicting any tex… Do base models already contain hidden reasoning ab… Can chain-of-thought reasoning be learned during p… Does RL teach reasoning or just when to use it? Can reinforcement learning improve models during g…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can models learn reasoning from predicting any text? Does training rationale generation at every token position on arbitrary internet text enable general reasoning without task-specific supervision? This challenges the assumption that reasoning requires curated QA datasets.
parallel token-level reasoning integration during pretraining
Do base models already contain hidden reasoning ability? Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
RPT may create stronger latent capabilities than standard pretraining
Can chain-of-thought reasoning be learned during pretraining itself? Explores whether reasoning emerges more effectively when models treat thinking as an exploratory action during next-token prediction, rather than only after pretraining through reinforcement learning.
RPT is the RL-native version of this bridge
Does RL teach reasoning or just when to use it? Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
RPT strengthens what RL post-training later activates: if pretraining embeds RL-trained reasoning patterns, the latent capability that post-training teaches "when" to deploy is richer than standard pretraining would produce
Can reinforcement learning improve models during general pretraining? Can RL work during standard pretraining on unverified text like Wikipedia, without reward models or labeled data? This matters because it would remove the data bottleneck that currently limits RL-based training to small verified domains.
sibling reinforcement-pretraining method: RPT makes every next-token a verifiable reward; PretrainZero adds *active selection* of informative, not-yet-mastered content to reinforce

Can next-token prediction become a reasoning task with RL?

Related concepts in this collection 5

Related papers in this collection 8

Search by related questions 4