Can models learn reasoning from predicting any text?

Does training rationale generation at every token position on arbitrary internet text enable general reasoning without task-specific supervision? This challenges the assumption that reasoning requires curated QA datasets.

Synthesis note · 2026-02-22 · sourced from Reasoning by Reflection

STaR showed that LMs can bootstrap reasoning by training on rationales that led to correct answers on curated QA datasets. Quiet-STaR generalizes this in one critical way: rather than generating a rationale per problem, it generates a rationale at every token position to explain future text. The training corpus is arbitrary internet text, not curated reasoning tasks.

The mechanism: at each token, the model generates a thought, mixes the thought-conditioned next-token prediction with the raw next-token prediction via a learned mixing head, and uses REINFORCE to improve thought quality. Custom meta-tokens signal thought boundaries, allowing the model to learn when to generate rationales and when to commit predictions.

The key shift: from task-specific reasoning ("do this type of math problem") to text-general reasoning ("what reasoning helps predict what comes next in any text?"). STaR's ceiling was its dependency on curated QA datasets — high-quality, but inherently narrow. Quiet-STaR's ceiling is the diversity of the pretraining corpus.

Because rationale quality is judged by predictive accuracy on future text rather than correctness on labeled answers, the method generalizes across the tasks present in language rather than the tasks present in annotation pipelines. The "task" is prediction itself.

This remains constrained by training distribution: rationales that help predict common internet text patterns may not generalize to hard reasoning requiring novel inference that rarely appears in the corpus. But it suggests that general reasoning competence may be trainable as a side effect of improved language modeling, rather than as a separate supervised objective.

Inquiring lines that use this note as a source 27

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 7

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

14 direct connections · 132 in 2-hop network ·dense cluster Open in graph ↗

Can models learn reasoning from predicting any t… Do base models already contain hidden reasoning ab… Does RL teach reasoning or just when to use it? Why doesn't mathematical reasoning transfer to med… Can training data augmentation match test-time com… Can models learn to internalize search algorithms … Can next-token prediction become a reasoning task … Can models learn to evaluate their own work during…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Do base models already contain hidden reasoning ability? Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
complements: Quiet-STaR offers a pretraining-time mechanism for the same underlying capability
Does RL teach reasoning or just when to use it? Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
contrasts: Quiet-STaR bakes reasoning into the forward pass at every token; RL teaches when to engage reasoning mechanisms at deployment
Why doesn't mathematical reasoning transfer to medicine? Can models trained to reason well about math apply those skills to medical domains through fine-tuning? This explores whether reasoning ability is truly domain-agnostic or constrained by domain-specific knowledge requirements.
extends: Quiet-STaR's ceiling is training distribution diversity; this note explains why general reasoning competence, however trained, hits a floor when domain-specific knowledge is absent
Can training data augmentation match test-time compute scaling benefits? Can generating thinking trajectories during pretraining unlock the same efficiency gains that test-time scaling provides at inference? This explores whether the compute-allocation principle works across the training-inference boundary.
parallel token-level reasoning during pretraining: Quiet-STaR modifies the training objective to learn rationales at each token, while TPT augments the training data with externally-generated thinking trajectories; different intervention points (objective vs. data) targeting the same problem of making pretraining reasoning-aware
Can models learn to internalize search algorithms through training? Can chain-of-thought reasoning be taught as an explicit search process that models learn to implement internally? This matters because it could unlock algorithmic optimization rather than just output optimization.
complementary internalization: Quiet-STaR trains token-level rationale generation via predictive accuracy, while Meta-CoT trains trace-level search strategies via linearized MCTS/A* — together they suggest reasoning internalization is possible at multiple granularities from individual predictions to complete search procedures
Can next-token prediction become a reasoning task with RL? Does reinforcement learning applied to next-token prediction during pretraining encourage genuine reasoning rather than surface memorization? This matters because it could unlock reasoning capability without requiring labeled data or human feedback.
parallel approach: RPT uses next-token verification as RL reward signal at the same token-level granularity; Quiet-STaR generates rationales via REINFORCE while RPT reasons about predictions via RL, both treating the pretraining corpus as the training signal for reasoning
Can models learn to evaluate their own work during training? Explores whether language models can internalize reward function computation as part of training, transforming external feedback into internal self-assessment capability without slowing inference.
complementary training-time reasoning augmentation: Quiet-STaR generates rationales at every token position, PCL generates self-evaluations in post-EOS space; both add auxiliary reasoning during training that shapes the model without inference cost, but at different positions (pre-token vs. post-answer)

Can models learn reasoning from predicting any text?

Related concepts in this collection 7

Related papers in this collection 8

Search by related questions 5