SYNTHESIS NOTE
Reasoning, Retrieval, and Evaluation Training, RL, and Test-Time Scaling Model Architecture and Internals

Can models learn reasoning from predicting any text?

Does training rationale generation at every token position on arbitrary internet text enable general reasoning without task-specific supervision? This challenges the assumption that reasoning requires curated QA datasets.

Synthesis note · 2026-02-22 · sourced from Reasoning by Reflection
How should we allocate compute budget at inference time? How should researchers navigate LLM reasoning research?

STaR showed that LMs can bootstrap reasoning by training on rationales that led to correct answers on curated QA datasets. Quiet-STaR generalizes this in one critical way: rather than generating a rationale per problem, it generates a rationale at every token position to explain future text. The training corpus is arbitrary internet text, not curated reasoning tasks.

The mechanism: at each token, the model generates a thought, mixes the thought-conditioned next-token prediction with the raw next-token prediction via a learned mixing head, and uses REINFORCE to improve thought quality. Custom meta-tokens signal thought boundaries, allowing the model to learn when to generate rationales and when to commit predictions.

The key shift: from task-specific reasoning ("do this type of math problem") to text-general reasoning ("what reasoning helps predict what comes next in any text?"). STaR's ceiling was its dependency on curated QA datasets — high-quality, but inherently narrow. Quiet-STaR's ceiling is the diversity of the pretraining corpus.

Because rationale quality is judged by predictive accuracy on future text rather than correctness on labeled answers, the method generalizes across the tasks present in language rather than the tasks present in annotation pipelines. The "task" is prediction itself.

This remains constrained by training distribution: rationales that help predict common internet text patterns may not generalize to hard reasoning requiring novel inference that rarely appears in the corpus. But it suggests that general reasoning competence may be trainable as a side effect of improved language modeling, rather than as a separate supervised objective.

Inquiring lines that use this note as a source 27

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 7

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
14 direct connections · 132 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

quiet-star learns rationale generation at the token level not the task level enabling general reasoning without task-specific supervision