Can training data augmentation match test-time compute scaling benefits?
Can generating thinking trajectories during pretraining unlock the same efficiency gains that test-time scaling provides at inference? This explores whether the compute-allocation principle works across the training-inference boundary.
Thinking augmented Pre-Training (TPT, 2509.20186) introduces a simple insight: some valuable tokens are too hard to learn in a single next-token prediction step because they represent the output of complex multi-step human reasoning. Rather than modifying the architecture, TPT augments the training data itself — generating thinking trajectories using open-source LLMs and interleaving them with the original text.
The key finding: 3x improvement in data efficiency, with 10%+ gains on reasoning benchmarks for a 3B model. No architecture changes. No human annotation. The thinking trajectories simulate an expert's analysis of the text, decomposing hard tokens into learnable intermediate steps.
The mechanism has a natural self-organizing property. Thinking trajectories are longer for domains like mathematics where reasoning is more intensive. A positive correlation exists between reasoning intensity of the original text and thinking length. This means harder tokens automatically receive more training compute through longer trajectories — functioning as a natural up-sampling mechanism for high-value data.
This is the training-time analog of test-time scaling. Since Can inference compute replace scaling up model size?, TPT shows the same principle operates during training: allocate more compute to harder tokens. The difference is the intervention point — training rather than inference.
The connection to Can next-token prediction become a reasoning task with RL? is complementary. RPT changes the training objective (RL instead of NTP). TPT changes the training data (augmented with thinking). Both target the same problem — standard NTP is insufficient for learning complex reasoning from data — but intervene at different levels.
Since Do base models already contain hidden reasoning ability?, TPT provides a pretraining-time mechanism for strengthening these latent capabilities. The thinking trajectories may serve as the training-time equivalent of the "minimal signals" that activate reasoning — making reasoning patterns more available for later post-training to refine.
A notable finding: the model trained on augmented data can surpass the performance of the LLM that generated the thinking trajectories. Explanation is easier than generation from scratch, so the student benefits from the teacher's explanatory labor even when the teacher's own generation capabilities are limited.
Inquiring lines that use this note as a source 29
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How does step-level compute allocation compare to response-level thinking?
- Does test-time compute actually substitute for having larger model parameters?
- Can offline context optimization reduce test-time latency like sleep-time compute?
- How much does pretraining contribute to ToM performance versus task-specific training?
- How does the three-component definition apply to test-time scaling laws?
- How does test-time compute substitute for model parameter scaling?
- Can test-time compute on smaller models replace larger model inference?
- What capabilities actually require massive scale versus specialized training regimes?
- What mechanisms drive test-time compute allocation in reasoning tasks?
- How much can mitigation techniques like augmentation reduce priming without harming learning?
- Can test-time compute allocation shift from solutions to strategies?
- How does task structure determine optimal test-time compute allocation?
- Where does sleep-time compute fit in the taxonomy of test-time scaling?
- How do internal versus external test-time scaling approaches differ from precomputation strategies?
- Can gradient-based influence estimation make test-time training more efficient?
- Does inference-time compute improve pretraining data efficiency in practice?
- Can test-time compute budgets be allocated differently per query difficulty?
- Can memory and test-time compute scale together as a single axis?
- Can test-time scaling work through retrieval rather than reasoning?
- Can test-time scaling compound through memory consolidation into a new scaling law?
- Why does test accuracy improve after training accuracy reaches 100 percent?
- Can the exploration ceiling be raised beyond what pretraining established?
- Can test-time compute fully replace scaling model parameters on hard problems?
- How do reward models guide inference-time compute allocation decisions?
- How does spending offline compute affect wake-time prediction latency?
- Why does pre-training provide the raw material for emergent thinking?
- Why should scaling laws be understood as properties of data distribution rather than training in general?
- Can test-time compute scaling substitute for larger model parameters?
- How does model scale affect anticipatory behavior in structured training?
Related concepts in this collection 5
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can next-token prediction become a reasoning task with RL?
Does reinforcement learning applied to next-token prediction during pretraining encourage genuine reasoning rather than surface memorization? This matters because it could unlock reasoning capability without requiring labeled data or human feedback.
complementary approach: changes objective vs. changes data
-
Do base models already contain hidden reasoning ability?
Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
TPT may strengthen latent capabilities that post-training later activates
-
Does training data format shape reasoning strategy more than domain?
What explains why models trained on multiple-choice data reason differently than those trained on free-form text? The research isolates format and domain effects to measure which one matters more.
thinking trajectories are a format intervention
-
Can inference compute replace scaling up model size?
Explores whether smaller models given more thinking time during inference can match larger models. Matters because it reshapes deployment economics and compute allocation strategies.
same principle at training time
-
Can models learn reasoning from predicting any text?
Does training rationale generation at every token position on arbitrary internet text enable general reasoning without task-specific supervision? This challenges the assumption that reasoning requires curated QA datasets.
parallel token-level reasoning during pretraining through different mechanisms: TPT embeds externally-generated thinking trajectories in training data, while Quiet-STaR learns to generate rationales at every token position via REINFORCE; TPT intervenes on data, Quiet-STaR on the training objective, but both make pretraining reasoning-aware at the individual token level
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Thinking Augmented Pre-training
- Learning to Think: Information-Theoretic Reinforcement Fine-Tuning for LLMs
- Does Thinking More always Help? Understanding Test-Time Scaling in Reasoning Models
- Towards Large Reasoning Models: A Survey on Scaling LLM Reasoning Capabilities
- What Characterizes Effective Reasoning? Revisiting Length, Review, and Structure of CoT
- Think Twice: Enhancing LLM Reasoning by Scaling Multi-round Test-time Thinking
- Rethinking Thinking Tokens: LLMs as Improvement Operators
- Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
Original note title
thinking-augmented pre-training increases data efficiency 3x by applying test-time scaling principles at training time