Can minimal training signals unlock reasoning already latent in pretrained representations?
This explores whether the reasoning ability LLMs show is something they already have from pretraining — waiting to be switched on by a small nudge — rather than a new skill that heavy training has to build from scratch.
This explores whether the reasoning ability LLMs show is something they already have from pretraining — waiting to be switched on by a small nudge — rather than a new skill that heavy training has to build from scratch. The corpus comes down surprisingly hard on the "already there" side. The clearest statement is that five independent mechanisms — RL steering, critique fine-tuning, decoding changes, sparse-autoencoder feature steering, and reinforcement learning with verifiable rewards — all elicit reasoning that already lives in base-model activations, which means post-training *selects* reasoning rather than *creating* it Do base models already contain hidden reasoning ability?. If that framing is right, the bottleneck was never capability acquisition; it was elicitation.
The most striking evidence for how *minimal* the signal can be: a single feature, identified inside the model with a sparse autoencoder, can be steered to match or beat full chain-of-thought prompting across six model families — and it fires early in generation, even overriding surface instructions Can we trigger reasoning without explicit chain-of-thought prompts?. Reasoning verbosity turns out to be a similarly linear, steerable direction: one vector pulled from just 50 paired examples cuts chain-of-thought length by two-thirds with no retraining Can we steer reasoning toward brevity without retraining?. And you don't even need weight changes — four modular "cognitive tools" implemented as sandboxed calls lifted GPT-4.1 on a hard math benchmark from 27% to 43% with zero RL, by enforcing the operation isolation that pure prompting can't guarantee Can modular cognitive tools unlock reasoning without training?. The recurring pattern is that the trigger is small and the capability is pre-existing.
But "latent and waiting" raises an obvious follow-up: latent how, and put there by what? Two notes argue the latency is itself a product of pretraining choices. Reasoning generalization rides on broad, transferable *procedural* knowledge spread across many documents — unlike factual recall, which depends on narrow memorization — so the raw material for reasoning is laid down diffusely during pretraining Does procedural knowledge drive reasoning more than factual retrieval?. You can also build the reasoning *into* pretraining directly: treating chain-of-thought as an exploratory action rewarded by information gain lifts math and science benchmarks ~19% Can chain-of-thought reasoning be learned during pretraining itself?, and looped architectures that iterate in latent space get 2–3× efficiency without extra capacity Can reasoning happen in latent space during pretraining?. Energy-based transformers push this furthest — reaching System-2-style deliberation from unsupervised learning alone, no domain-specific scaffolding Can energy minimization unlock reasoning without domain-specific training?.
Here's the unsettling part, and the thing you might not have known you wanted to know: if a tiny signal unlocks reasoning, maybe what's being unlocked isn't "reasoning" in the strong sense at all. Models trained on *deliberately corrupted* reasoning traces perform about as well as those trained on correct ones — sometimes generalizing better — which suggests the traces work as computational scaffolding, not as meaningful logic Do reasoning traces need to be semantically correct?. Chain-of-draft reaches full accuracy at 7.6% of the tokens, because most of the words were style and documentation, not computation Can minimal reasoning chains match full explanations?. And when semantic content is stripped from a task, performance collapses even with correct rules in hand — LLMs reason through learned associations, not symbolic manipulation Do large language models reason symbolically or semantically?, reproducing familiar schemata that degrade predictably under distribution shift Does chain-of-thought reasoning reveal genuine inference or pattern matching?. So the answer is yes — minimal signals reliably unlock something latent — but the corpus quietly reframes the question: the "reasoning" you elicit so cheaply may be a pattern already compiled into the weights, which is exactly why so little is needed to switch it on.
Sources 12 notes
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
SAE-identified reasoning features can be directly steered to match or exceed chain-of-thought performance across six model families. This reasoning mode activates early in generation and overrides surface-level instructions, suggesting latent reasoning is a fundamental capability independent of explicit prompting.
Activation-Steered Compression extracts a single vector from 50 paired examples to reduce chain-of-thought length by 67% while maintaining accuracy and achieving 2.73x speedup. The method is training-free and generalizes across model sizes and domains.
Four cognitive tools implemented as sandboxed LLM calls improved GPT-4.1 on AIME2024 from 26.7% to 43.3% without any RL training. Modularity enforces operation isolation that pure prompting cannot guarantee, eliciting pre-existing reasoning capability.
Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.
RLP treats CoT as exploratory action during pretraining, using log-likelihood improvement as verifier-free reward. Applied to Qwen3-1.7B and Nemotron-Nano-12B, the method improves math and science benchmarks substantially, suggesting reasoning can be planted earlier in training.
Ouro models achieve 2–3× efficiency gains by performing iterative reasoning in latent space during pretraining, not through extra capacity. Their intermediate predictions align faithfully with final outputs, making latent traces more honest than explicit chain-of-thought reasoning.
Energy-Based Transformers assign energy values to input-prediction pairs and use gradient descent minimization for inference, yielding 35% higher training scaling rates and 29% more inference-compute gains than Transformer++, while generalizing better on out-of-distribution data without domain-specific scaffolding.
Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.
Chain of Draft achieves equivalent accuracy to standard chain-of-thought on arithmetic, symbolic, and commonsense tasks while using only 7.6% of tokens. The 92.4% of removed tokens served style and documentation, not computation.
When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.
CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.