SYNTHESIS NOTE
Training, RL, and Test-Time Scaling Reasoning, Retrieval, and Evaluation Model Architecture and Internals

Can models learn to internalize search algorithms through training?

Can chain-of-thought reasoning be taught as an explicit search process that models learn to implement internally? This matters because it could unlock algorithmic optimization rather than just output optimization.

Synthesis note · 2026-02-23 · sourced from Inference time scaling
How should we allocate compute budget at inference time? How should researchers navigate LLM reasoning research?

Standard chain-of-thought produces a reasoning trace. Meta-CoT asks a different question: what search process generates that trace? The framework draws from dual-process theory — CoT is System 1 (pattern-completed reasoning), while Meta-CoT is System 2 (deliberate search over reasoning strategies). The claim is that state-of-the-art models like o1 and DeepSeek-R1 already exhibit behaviors consistent with in-context search: they explore multiple paths, backtrack, and select among candidate reasoning chains rather than generating a single trace sequentially.

The training pipeline makes the internalization concrete: (1) generate linearized search traces from MCTS or A* algorithms applied to reasoning problems, (2) instruction-tune on these traces so the model learns the structure of search, (3) RL post-training to refine the search behavior. The linearized traces are the key innovation — they convert tree-structured search into sequential token predictions that autoregressive models can learn.

The speculative but important claim: if a model can learn to implement search algorithms in-context, then RL training on such a model constitutes optimization over algorithms rather than specific outputs. This could yield novel modes of problem-solving that neither symbolic tree-search nor standard CoT can achieve, because the model is not constrained by the specific search algorithm it was trained on — it can adapt and combine strategies.

This extends Does RL teach reasoning or just when to use it? in a significant direction: Meta-CoT proposes that search IS trainable as "how." The timing thesis says RL teaches WHEN to reason; Meta-CoT says the reasoning process itself can be internalized through exposure to search traces. If both are correct, RL training operates at two levels: activating reasoning (timing) and shaping the reasoning process (search internalization).

However, the tension with Does the choice of RL algorithm actually matter for reasoning? is notable: if the pretrained prior bounds exploration, then internalized search may still be constrained by what the model already knows. Meta-CoT would need to demonstrate that linearized search traces genuinely expand the exploration boundary rather than just reorganizing existing capability.

Inquiring lines that use this note as a source 7

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
13 direct connections · 134 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

meta-cot frames chain-of-thought production as a search problem that models can learn to internalize