Do base models already contain hidden reasoning ability?

Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.

Synthesis note · 2026-02-22 · sourced from Reasoning Architectures

Three convergent findings build a strong case that reasoning capability is primarily a pre-training phenomenon:

Finding 1 (Base Models paper): Base models already spontaneously demonstrate strong reasoning capabilities and "aha moment" self-reflection patterns when sampled sufficiently. Reasoning traces generated by RL-fine-tuned models are already present in base model outputs — they just appear with lower frequency. RL biases generation toward high-reward patterns; it doesn't create new patterns.

Finding 2 (Steering): A hybrid model using base model weights + thinking model steering vectors recovers 91% of the performance gap to thinking models while steering only 12% of tokens. The reasoning mechanisms (backtracking, uncertainty estimation, subgoal-setting) already exist as directions in the base model's activation space.

Finding 3 (CFT/RLVR): Critique Fine-Tuning on a single problem can unlock reasoning potential at RLVR-level effectiveness. By exposing the model to diverse critiques of varied incorrect solutions to one problem, CFT activates reasoning patterns already latent in the base model without requiring hundreds of GPU hours of RL training.

Finding 4 (CoT-Decoding): Pre-trained LLMs inherently contain CoT reasoning paths that can be elicited simply by altering the decoding procedure. Rather than greedy decoding, inspecting top-k alternative tokens reveals that CoT paths are frequently present in the model's probability distribution. A confidence metric differentiates CoT from non-CoT paths — the model shows increased confidence in its final answer when a CoT reasoning path is present. This is entirely unsupervised, requiring no prompting, tuning, or training modifications — purely a decoding change. CoT-decoding adds a fourth mechanism to the latent capability evidence: RL steering, CFT, RLVR, and now decoding all unlock reasoning already present.

Finding 5 (SAE Reasoning Steering): Sparse Autoencoders decompose model activations into interpretable features, revealing latent features causally associated with reasoning behavior. Steering a single identified reasoning feature at the first generation step matches or exceeds CoT performance across six model families up to 70B parameters — without any explicit CoT prompting. The reasoning mode triggers early in generation and is robust enough to override prompt-level \no_think instructions. This is the most direct mechanistic evidence yet: the capability is not just present (as CoT-decoding shows) but causally controllable through a single latent dimension. See Can we trigger reasoning without explicit chain-of-thought prompts?. Together with CoT-decoding (Finding 4), this establishes five independent elicitation mechanisms: RL steering, CFT, RLVR, decoding, and SAE feature steering — all converging on the same latent capability.

The synthesis: post-training methods are selectors, not creators. They select which of the base model's latent capabilities to express reliably in context. The implication is that the main bottleneck for reasoning is not capability acquisition (which happens during pre-training on the world's text) but capability elicitation.

RLVR evidence deepens this: Two additional findings from the RLVR literature reinforce the latent-capability thesis. First, 1-shot RLVR achieves a 37-point jump on MATH500 (36%→73.6%) from a single training example. After the model perfectly memorizes its one example, test accuracy continues improving for 1,400 more steps — post-saturation generalization. The data is exhausted, but activation continues. See Can a single training example unlock mathematical reasoning?. Second, spurious rewards — random, incorrect, or format-only — improve Qwen2.5-Math nearly as much as correct rewards (~21-25% improvement). But the same spurious rewards fail completely for Llama3.1 and OLMo2. The differentiating variable is not reward quality but pretraining: Qwen's code-reasoning pretraining creates latent capability that any optimization pressure can activate. See Why do random rewards improve reasoning for some models but not others?. Together with the pass@k finding that RLVR narrows capability scope rather than expanding it, the evidence converges: RLVR is a catalyst that triggers a phase transition from broad pretraining distribution to reliable sampling of correct answers.

This partially contradicts Can simple rewards alone teach complex domain reasoning? — that note documents genuine capability emergence in domain-specialized contexts (medical, mathematical). The reconciliation: emergence may reflect reliable expression of latent capability, not creation from scratch. The distinction matters for research direction: if capability already exists, the investment in RL may be better directed toward elicitation methods.

The implication for Can prompt optimization teach models knowledge they lack?: the same principle extends to reasoning capability, not just knowledge.

Inquiring lines that use this note as a source 317

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 16

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

30 direct connections · 229 in 2-hop network ·medium cluster Open in graph ↗

Do base models already contain hidden reasoning … Can simple rewards alone teach complex domain reas… Does RL teach reasoning or just when to use it? Can prompt optimization teach models knowledge the… Can non-reasoning models catch up with more comput… Can a single training example unlock mathematical … Why do random rewards improve reasoning for some m… Does RLVR actually expand what models can reason a… Does procedural knowledge drive reasoning more tha…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can simple rewards alone teach complex domain reasoning? Does reinforcement learning on difficult problems with basic accuracy rewards produce sophisticated reasoning strategies without explicit chain-of-thought training? This challenges assumptions about what domain AI models need to learn effectively.
partially contradicted: "emergence" may be reliable expression of latent capability, not creation
Does RL teach reasoning or just when to use it? Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
mechanism: if base models have capability, RL teaches timing of deployment
Can prompt optimization teach models knowledge they lack? Explores whether sophisticated prompting techniques can inject new domain knowledge into language models, or if they're limited to activating existing training knowledge.
extends to reasoning capability not just knowledge
Can non-reasoning models catch up with more compute? Explores whether inference-time compute budget can close the performance gap between standard models and those trained for reasoning, and what training mechanisms might enable this.
qualified: targeted activation methods can close most of the gap
Can a single training example unlock mathematical reasoning? Explores whether one example is enough to dramatically improve math problem-solving in language models, and whether learning continues after perfect memorization.
strongest evidence: one example activates 37-point gain with continued generalization
Why do random rewards improve reasoning for some models but not others? When RLVR training uses meaningless reward signals, some models gain reasoning improvements while others don't. What determines which models can benefit from optimization pressure without meaningful feedback?
pretraining determines activation potential; reward signal is the catalyst, not the teacher
Does RLVR actually expand what models can reason about? Explores whether reinforcement learning from verifiable rewards teaches models genuinely new reasoning skills or simply makes existing capabilities more reliable. Pass@k analysis suggests the latter.
pass@k confirms RLVR selects from existing capability, does not create new
Does procedural knowledge drive reasoning more than factual retrieval? Explores whether models learn reasoning through general procedures across diverse documents rather than memorizing specific facts. This matters for understanding what pretraining data actually teaches models to reason.
identifies what the latent capability consists of: procedural knowledge synthesized from diverse pretraining documents that demonstrates how to reason, not what to recall; this is what minimal training signals activate
Can models learn when to think versus respond quickly? Explores whether a single language model can adaptively choose between extended reasoning and direct responses based on task difficulty. This matters because it could make inference more efficient by allocating compute only when needed.
concrete implementation of the latent-capability thesis: Thinkless trains only a routing token via DeGRPO, not reasoning capability; the design premise is that capability is already present and what's needed is adaptive activation
Can models learn to internalize search algorithms through training? Can chain-of-thought reasoning be taught as an explicit search process that models learn to implement internally? This matters because it could unlock algorithmic optimization rather than just output optimization.
extends beyond activation: Meta-CoT claims linearized search traces can teach genuinely new search capability, not just unlock existing patterns — testing the boundary of the latent-capability thesis
Does reinforcement learning on theory of mind collapse with model scale? When RL improves social reasoning, does the quality of reasoning depend on model size? The question matters because accuracy alone may hide whether models are actually thinking or just pattern-matching.
the scale-dependent finding adds a social-reasoning dimension: 7B models have latent ToM capability that RL can activate, but smaller models lack sufficient latent capacity for social reasoning, suggesting a domain-specific threshold below which the latent-capability thesis does not hold
Does reinforcement learning update only a small fraction of parameters? Investigating whether RL algorithms consistently modify only 5–30% of model parameters across different LLMs and RL methods, and what structural properties those sparse updates possess.
parametric signature of latent capability: RL touches only 5-30% of parameters because the rest already encode adequate reasoning; the sparsity is intrinsic and consistent across 7 algorithms and 10 models, confirming capability preexists in the weights
Can next-token prediction become a reasoning task with RL? Does reinforcement learning applied to next-token prediction during pretraining encourage genuine reasoning rather than surface memorization? This matters because it could unlock reasoning capability without requiring labeled data or human feedback.
strengthens the foundation: RPT may create stronger latent capabilities than standard pretraining by embedding RL reasoning patterns during pretraining itself, making the subsequent minimal-signal activation even more effective
Can models improve themselves on tasks without verifiable answers? Most self-improvement methods require verifiable correctness signals like math or code. Can models improve on open-ended instruction tasks where right answers aren't automatically checkable? And what minimal training is needed to unlock this?
extends the minimal-signal thesis to general instruction tasks: 1000 demonstrations of reasoning enrichment are sufficient to enable iterative self-improvement, consistent with the latent capability thesis — the catalyst teaches articulation of reasoning, not reasoning itself
Can careful selection of 78 demos outperform massive training datasets? Does strategic curation of high-quality demonstrations unlock agentic capability more efficiently than scaling training data? LIMI achieved 73.5% on AgencyBench with 78 samples versus 10K+ samples for competing models, suggesting data quality may matter more than quantity.
extends the latent-capability thesis from reasoning to autonomous agency: 78 curated trajectories outperform 10K+ samples, suggesting agentic behavior is also a latent capability that minimal signals can activate
Can we trigger reasoning without explicit chain-of-thought prompts? This research asks whether models possess latent reasoning capabilities that can be activated through direct feature steering, independent of chain-of-thought instructions. Understanding this matters for making reasoning more efficient and controllable.
most direct mechanistic evidence: single latent feature causally controls reasoning activation across 6 model families up to 70B

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

base models already possess latent reasoning capability that minimal training signals can unlock

Do base models already contain hidden reasoning ability?

Related concepts in this collection 16

Related papers in this collection 8

Search by related questions 4