Can next-token prediction become a reasoning task with RL?
Does reinforcement learning applied to next-token prediction during pretraining encourage genuine reasoning rather than surface memorization? This matters because it could unlock reasoning capability without requiring labeled data or human feedback.
Reinforcement Pre-Training (RPT) bridges self-supervised pretraining and reinforcement learning by reframing next-token prediction as next-token reasoning. For any context in a pretraining corpus, the model is incentivized to reason about the subsequent token before predicting it, receiving a verifiable reward based on prediction correctness against the ground-truth next token.
This transforms the scalability bottleneck of RL for LLMs. Standard RLHF requires costly human preferences. RLVR requires domain-specific verifiable answers. RPT requires nothing beyond the pretraining corpus — the ground-truth next token is the verifiable reward. The entire internet becomes RL training data.
Three structural advantages emerge. First, the reward signal is rule-based (correct/incorrect next-token prediction), which inherently minimizes reward hacking — there is no learned reward model to exploit. Second, by encouraging reasoning patterns before each prediction, RPT promotes deeper understanding rather than surface memorization of token sequences. Third, the internal reasoning process allocates more computational effort per prediction step — a form of inference-time scaling applied at training time.
Since Can models learn reasoning from predicting any text?, RPT operates at the same granularity but with a fundamentally different mechanism. Quiet-STaR learns to generate useful rationales between tokens via a reinforcement signal. RPT learns to reason about what comes next via next-token verification. Both suggest that token-level reasoning during pretraining is a viable path to general reasoning capability.
The scaling curves show consistent improvement with increased training compute — more RPT training means better next-token prediction accuracy. RPT also provides a strong foundation for subsequent reinforcement fine-tuning, suggesting the reasoning patterns learned during pretraining compose with downstream RL rather than conflicting with it.
Inquiring lines that use this note as a source 14
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Does next-token prediction alone produce genuine functional language competence?
- Does reinforcement learning learn optimal per-turn reasoning discipline?
- Can next-token prediction train models to optimize for communication efficiency?
- Do thought anchors correspond mechanistically to planning tokens in RL?
- How do high-entropy tokens concentrate reinforcement learning's effect?
- Does reinforcement learning teach models how to reason or when to reason?
- Does next-token prediction actually explain how human thought works?
- How does predictive accuracy on future tokens differ from correctness on labeled answers?
- What makes token-level reasoning during pretraining different from test-time chain-of-thought?
- Does token-level reasoning during pretraining improve general reasoning without task-specific supervision?
- Does the token prediction framing actually capture what human reasoning does?
- Can standard next-token prediction capture complex multi-step human reasoning directly?
- Does targeting the edge of competence during RL pretraining unlock true reasoning gains?
- Why do standard next-token prediction models struggle with conversational initiative?
Related concepts in this collection 5
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can models learn reasoning from predicting any text?
Does training rationale generation at every token position on arbitrary internet text enable general reasoning without task-specific supervision? This challenges the assumption that reasoning requires curated QA datasets.
parallel token-level reasoning integration during pretraining
-
Do base models already contain hidden reasoning ability?
Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
RPT may create stronger latent capabilities than standard pretraining
-
Can chain-of-thought reasoning be learned during pretraining itself?
Explores whether reasoning emerges more effectively when models treat thinking as an exploratory action during next-token prediction, rather than only after pretraining through reinforcement learning.
RPT is the RL-native version of this bridge
-
Does RL teach reasoning or just when to use it?
Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
RPT strengthens what RL post-training later activates: if pretraining embeds RL-trained reasoning patterns, the latent capability that post-training teaches "when" to deploy is richer than standard pretraining would produce
-
Can reinforcement learning improve models during general pretraining?
Can RL work during standard pretraining on unverified text like Wikipedia, without reward models or labeled data? This matters because it would remove the data bottleneck that currently limits RL-based training to small verified domains.
sibling reinforcement-pretraining method: RPT makes every next-token a verifiable reward; PretrainZero adds *active selection* of informative, not-yet-mastered content to reinforce
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- RLP: Reinforcement as a Pretraining Objective
- Reinforcement Pre-Training
- On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models
- Eliciting Reasoning in Language Models with Cognitive Tools
- Base Models Know How to Reason, Thinking Models Learn When
- Do Theory of Mind Benchmarks Need Explicit Human-like Reasoning in Language Models?
- Local Coherence or Global Validity? Investigating RLVR Traces in Math Domains
- ProtoReasoning: Prototypes as the Foundation for Generalizable Reasoning in LLMs
Original note title
reinforcement pre-training reframes next-token prediction as a reasoning task trained with rl — using the corpus itself as verifiable reward