Can models learn to internalize search algorithms through training?

Can chain-of-thought reasoning be taught as an explicit search process that models learn to implement internally? This matters because it could unlock algorithmic optimization rather than just output optimization.

Synthesis note · 2026-02-23 · sourced from Inference time scaling

Standard chain-of-thought produces a reasoning trace. Meta-CoT asks a different question: what search process generates that trace? The framework draws from dual-process theory — CoT is System 1 (pattern-completed reasoning), while Meta-CoT is System 2 (deliberate search over reasoning strategies). The claim is that state-of-the-art models like o1 and DeepSeek-R1 already exhibit behaviors consistent with in-context search: they explore multiple paths, backtrack, and select among candidate reasoning chains rather than generating a single trace sequentially.

The training pipeline makes the internalization concrete: (1) generate linearized search traces from MCTS or A* algorithms applied to reasoning problems, (2) instruction-tune on these traces so the model learns the structure of search, (3) RL post-training to refine the search behavior. The linearized traces are the key innovation — they convert tree-structured search into sequential token predictions that autoregressive models can learn.

The speculative but important claim: if a model can learn to implement search algorithms in-context, then RL training on such a model constitutes optimization over algorithms rather than specific outputs. This could yield novel modes of problem-solving that neither symbolic tree-search nor standard CoT can achieve, because the model is not constrained by the specific search algorithm it was trained on — it can adapt and combine strategies.

This extends Does RL teach reasoning or just when to use it? in a significant direction: Meta-CoT proposes that search IS trainable as "how." The timing thesis says RL teaches WHEN to reason; Meta-CoT says the reasoning process itself can be internalized through exposure to search traces. If both are correct, RL training operates at two levels: activating reasoning (timing) and shaping the reasoning process (search internalization).

However, the tension with Does the choice of RL algorithm actually matter for reasoning? is notable: if the pretrained prior bounds exploration, then internalized search may still be constrained by what the model already knows. Meta-CoT would need to demonstrate that linearized search traces genuinely expand the exploration boundary rather than just reorganizing existing capability.

Inquiring lines that use this note as a source 7

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

13 direct connections · 134 in 2-hop network ·dense cluster Open in graph ↗

Can models learn to internalize search algorithm… Does RL teach reasoning or just when to use it? Do base models already contain hidden reasoning ab… Can reinforcement learning discover reasoning stra… Does the choice of RL algorithm actually matter fo… Can models learn reasoning from predicting any tex…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does RL teach reasoning or just when to use it? Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
extends: Meta-CoT proposes that search CAN be trained as the "how" component
Do base models already contain hidden reasoning ability? Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
Meta-CoT goes further: linearized traces may teach a new capability, not just unlock existing
Can reinforcement learning discover reasoning strategies base models cannot? Does RL training truly expand what models can do, or does it just find solutions already hidden in base models? ProRL tests this by running RL longer and on diverse tasks beyond mathematics.
supports: algorithm optimization could be the mechanism for genuine novelty
Does the choice of RL algorithm actually matter for reasoning? Expert Iteration, PPO, and RC-RL show similar performance on reasoning tasks. The question is whether algorithm choice drives results or whether something deeper—like the pretrained model itself—sets the real limits.
tension: Meta-CoT claims search is trainable but prior-boundedness may constrain what internalized search can discover
Can models learn reasoning from predicting any text? Does training rationale generation at every token position on arbitrary internet text enable general reasoning without task-specific supervision? This challenges the assumption that reasoning requires curated QA datasets.
complementary internalization approaches: Quiet-STaR internalizes rationale generation at every token during pretraining, while Meta-CoT internalizes search algorithms via linearized traces during post-training — both aim to embed reasoning into the forward pass but at different granularities (token-level prediction vs. trace-level search strategy)

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

meta-cot frames chain-of-thought production as a search problem that models can learn to internalize

Can models learn to internalize search algorithms through training?

Related concepts in this collection 5

Related papers in this collection 8

Search by related questions 4