Does inference-time compute improve pretraining data efficiency in practice?

This explores whether spending more compute at inference time (test-time reasoning, longer thinking traces) actually lets models learn more from less pretraining data — and the corpus suggests the real lever is moving that compute *into* training rather than reserving it for inference.

This explores whether inference-time compute improves pretraining data efficiency — and the most direct answer in the corpus flips the question: the biggest efficiency gains come from importing the *logic* of test-time scaling back into pretraining itself. The clearest evidence is thinking-augmented pretraining, where pretraining data is enriched with LLM-generated reasoning traces and harder tokens automatically attract longer traces — a built-in compute-allocation mechanism that mirrors how test-time scaling spends more on hard prompts. The result is a 3x data-efficiency gain and a 10%+ reasoning bump for a 3B model Can training data augmentation match test-time compute scaling benefits?. So in practice, the answer isn't "add inference compute to a fixed model" — it's "teach the model to reason while pretraining, and you need less data."

That reframing matters because raw inference compute alone hits a wall. Non-reasoning models don't catch up to reasoning models no matter how large the inference budget, because what makes extra tokens *productive* is a reasoning protocol instilled during training, not the tokens themselves Can non-reasoning models catch up with more compute?. Inference compute is a lever only when training has already shaped the model to use it well. The complementary move is to make that protocol part of pretraining from the start: RLP treats chain-of-thought as an exploratory action during pretraining and rewards it by how much it improves next-token prediction, lifting math and science benchmarks ~19% — reasoning planted earlier rather than bolted on after Can chain-of-thought reasoning be learned during pretraining itself?.

There's a deeper reason this works so well, and it's the thing you might not expect: a lot of "new" reasoning capability was already latent in the base model. Five independent methods — RL steering, critique fine-tuning, decoding tweaks, SAE feature steering, RLVR — all surface reasoning that already lives in base-model activations, meaning post-training *selects* rather than *creates* the ability Do base models already contain hidden reasoning ability?. If the capability is mostly there, then the efficiency question becomes one of elicitation, and inference-time or lightweight-training methods are cheap ways to unlock what your pretraining data already paid for. Adaptive test-time compute fits here too: spending more on hard prompts and less on easy ones beats uniform budgets How should we allocate compute budget at inference time? — useful, but it's harvesting latent capability, not manufacturing new data efficiency.

The honest caveat is that inference-time reasoning doesn't generalize for free. Chain-of-thought degrades predictably once you push outside the training distribution — models produce fluent but logically broken reasoning, imitating the *form* of reasoning without the underlying logic Does chain-of-thought reasoning actually generalize beyond training data?. So inference compute amplifies whatever the training distribution taught; if the data didn't cover a regime, more thinking tokens won't conjure competence there. This is exactly why the corpus leans toward building reasoning into pretraining and toward learning *when* to think versus answer directly Can models learn when to think versus respond quickly? — you get the efficiency by allocating compute intelligently across both phases, not by treating inference as a substitute for data.

The takeaway a curious reader walks away with: "inference-time compute improves data efficiency" is true mainly in an indirect, surprising way — the winning recipe is to *move test-time reasoning into pretraining* (generate the thinking, reward it during training), because the base model already holds latent reasoning and what's scarce is the protocol that makes extra compute pay off.

Sources 7 notes

Can training data augmentation match test-time compute scaling benefits?

Augmenting pretraining data with LLM-generated reasoning traces improves data efficiency 3x and reasoning benchmark performance 10%+ for 3B models. Harder tokens automatically receive longer traces, creating a natural compute-allocation mechanism analogous to test-time scaling.

Can non-reasoning models catch up with more compute?

Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.

Can chain-of-thought reasoning be learned during pretraining itself?

RLP treats CoT as exploratory action during pretraining, using log-likelihood improvement as verifier-free reward. Applied to Qwen3-1.7B and Nemotron-Nano-12B, the method improves math and science benchmarks substantially, suggesting reasoning can be planted earlier in training.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

How should we allocate compute budget at inference time?

Research shows that dynamically adjusting inference compute per prompt—rather than using fixed budgets—improves performance and efficiency. Uniform spending wastes resources on easy problems while underserving hard ones.

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

Can models learn when to think versus respond quickly?

Thinkless trains a single model to select between extended reasoning and direct responses using DeGRPO, which decouples mode selection from answer refinement. This prevents mode collapse and enables self-calibrated routing without explicit difficulty labels.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a researcher auditing claims about inference-time compute and data efficiency in LLM pretraining. The question remains open: does inference-time compute improve pretraining data efficiency in practice?

What a curated library found — and when (dated claims, not current truth):
Findings span October 2024 to October 2025. Key constraints and findings:
- Thinking-augmented pretraining (injecting LLM-reasoned traces into data) yields 3x data-efficiency gain and 10%+ reasoning improvement for 3B models, but this *imports test-time logic into pretraining*, not raw inference compute (2025-09).
- Non-reasoning models cannot match reasoning models even with unlimited inference budget, because productive extra tokens require a reasoning *protocol instilled during training* (2025-04).
- Latent reasoning capability already exists in base-model activations; five independent methods (RL steering, critique, decoding, SAE steering, RLVR) surface it via *selection*, not creation (2025-04).
- Chain-of-thought reasoning degrades predictably outside training distribution—fluent but logically broken—meaning inference compute amplifies what data taught, it doesn't substitute for coverage (2025-08).
- Adaptive test-time compute (spending more on hard prompts) beats uniform budgets but harvests latent capability, not generating new data efficiency (2025-06).

Anchor papers (verify; mind their dates):
- arXiv:2509.20186 (2025-09) Thinking Augmented Pre-training
- arXiv:2504.09858 (2025-04) Reasoning Models Can Be Effective Without Thinking
- arXiv:2508.01191 (2025-08) Is Chain-of-Thought Reasoning of LLMs a Mirage?
- arXiv:2505.13379 (2025-05) Thinkless: LLM Learns When to Think

Your task:
(1) RE-TEST EACH CONSTRAINT. For the five claims above, probe whether newer models, training methods (e.g., synthetic data generation, RL objectives), tooling (inference engines with adaptive batching), or post-training (e.g., test-time scaling via Anthropic's Constitutional AI patterns) have since relaxed or overturned them. Separate the durable question—*does raw compute at test time substitute for data coverage?*—from perishable limitations. Does the latent-reasoning hypothesis still hold? Can inference compute now *create* rather than *select*?
(2) Surface the strongest work from the last 6 months that *contradicts* the synthesis's finding that reasoning must be baked into pretraining. Look for papers showing inference-only methods that genuinely improve data efficiency without retraining.
(3) Propose 2 research questions that assume the regime may have shifted: e.g., "Can retrieval-augmented test-time reasoning overcome distribution-bounded CoT?" or "Do recent scaling laws break the latent-capability model?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Does inference-time compute improve pretraining data efficiency in practice?

Sources 7 notes

Next inquiring lines