How do timing and search internalization interact during reasoning post-training?
This explores two threads the corpus keeps tangling together: *timing* — when reasoning gets installed (pretraining vs. post-training) and how long a model is allowed to think at inference — and *search internalization* — whether a model learns to run search algorithms in its own head rather than calling out for them; the question is how those two pull on each other.
This explores two threads the corpus keeps tangling together: when reasoning gets installed and how long the model is allowed to think (timing), versus whether the model has folded search into its own forward pass (internalization). The cleanest statement of internalization is Meta-CoT, which trains on linearized traces of algorithms like MCTS and A* so the model implements the search internally rather than enumerating it step by step Can models learn to internalize search algorithms through training?. The striking thing is what that buys you: once search lives inside the weights, the inference-time cost of that search collapses — you're no longer paying token-by-token for exploration. That's the first place timing and internalization meet.
But internalization doesn't make timing free. Two notes show that search-at-inference and thinking-at-inference obey the *same* scaling law: deep-research agents improve with more search steps along a monotonic-then-diminishing curve that mirrors the reasoning-token curve exactly Do search steps follow the same scaling rules as reasoning tokens? Does search budget scale like reasoning tokens for answer quality?. So even a model that has internalized *how* to search still faces a budget question about *how much*. And budget cuts both ways — past a critical token count, accuracy doesn't plateau, it drops, falling from 87% to 70% as thinking tokens climbed from ~1,100 to 16K Does more thinking time always improve reasoning accuracy? When does thinking too much actually hurt reasoning?. More internal search, run too long, manufactures self-revision errors instead of answers.
The quality of that internal search turns out to be a training artifact, not a fixed property. Vanilla models use extended thinking *counterproductively* — the same thinking mechanism induces self-doubt that degrades performance — until RL training flips it into productive gap analysis Does extended thinking help or hurt model reasoning?. So post-training doesn't just decide whether search is internalized; it decides whether spending more time helps or hurts. That reframes the overthinking findings: the token threshold where reasoning collapses is itself movable by how you trained the model to use its thinking budget.
Underneath all of this is a quieter claim about timing that the question almost dares you to assume: that post-training is where reasoning is *created*. The corpus pushes back. One line of work finds that five independent methods all merely *elicit* reasoning already latent in base-model activations — post-training selects rather than builds Do base models already contain hidden reasoning ability?. Another pushes the timing earlier still, planting chain-of-thought during pretraining itself using information-gain rewards Can chain-of-thought reasoning be learned during pretraining itself?. And analysis of pretraining documents suggests what's being internalized is broad *procedural* knowledge — transferable how-to patterns — not memorized facts Does procedural knowledge drive reasoning more than factual retrieval?. If search-like procedure is laid down that early, post-training's job is more about timing and routing than installation.
The useful surprise here: internalizing search and choosing how long to think aren't a trade-off you tune once — they're coupled knobs. Backward-reasoning training internalizes a consistency check that improves forward answers with *zero* test-time overhead Can backward reasoning during training improve forward reasoning?, which is the dream case — capability moved entirely into training-time so inference timing barely matters. The cautionary case is that an internalized procedure can still be hollow: chain-of-thought trained on a distribution degrades predictably off it, producing fluent reasoning with no valid logic underneath Does chain-of-thought reasoning actually generalize beyond training data?. Internalized search is only as good as the territory it was internalized over.
Sources 11 notes
Meta-CoT demonstrates that instruction-tuning on linearized MCTS and A* traces teaches models to implement search strategies internally. This enables optimization over algorithms themselves rather than specific outputs, potentially unlocking novel reasoning strategies.
Deep research agents improve with more search steps in a pattern mirroring the reasoning-token relationship, with both exhibiting diminishing returns. This reveals a new inference-compute axis beyond model capability alone.
Agentic deep research shows monotonic-to-diminishing-returns curves for search iterations, matching reasoning token scaling. This creates a new inference-compute axis: models can trade off reasoning budget against search budget to optimize answer quality.
Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.
Empirical studies demonstrate non-monotonic scaling in test-time reasoning: accuracy peaks at a critical thinking-token count, then declines sharply (87.3% to 70.3% as tokens scale from 1,100 to 16,000). Extended thinking inflates output variance and introduces self-revision errors rather than improving solution quality.
Vanilla models use thinking mode counterproductively, inducing self-doubt that degrades performance. RL training reverses this, transforming the same mechanism into beneficial gap analysis. Training mediates reasoning quality, not just quantity.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
RLP treats CoT as exploratory action during pretraining, using log-likelihood improvement as verifier-free reward. Applied to Qwen3-1.7B and Nemotron-Nano-12B, the method improves math and science benchmarks substantially, suggesting reasoning can be planted earlier in training.
Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.
Training models simultaneously on forward reasoning, backward question generation, and backward reasoning improves forward-only performance by 13.53% average across 12 datasets. The mechanism: generating backward questions forces models to understand the inverse relationship between problem and solution, deepening understanding that transfers to forward reasoning without test-time overhead.
DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.