Can targeted activation steering surface latent reasoning in base models?

This explores whether you can directly intervene in a base model's internal activations to unlock reasoning it already holds — and whether that reasoning was there all along, waiting to be switched on rather than taught.

This explores whether targeted activation steering can surface reasoning that already lives latent inside base models — and the corpus answers with a fairly strong yes, while reframing what "steering" even means. The central claim across several notes is that post-training doesn't create reasoning; it selects and elicits reasoning the base model already contains. One synthesis Do base models already contain hidden reasoning ability? gathers five independent mechanisms — RL steering, critique fine-tuning, decoding changes, sparse-autoencoder (SAE) feature steering, and RLVR — that all converge on the same conclusion: the bottleneck is elicitation, not capability acquisition. So the question isn't really "can we add reasoning?" but "which knob switches on what's already there?"

The most direct evidence that activation steering works comes from SAE feature manipulation. Steering a single SAE-identified reasoning feature can match or exceed chain-of-thought performance across six model families, and notably this mode activates early in generation and overrides surface-level prompt instructions Can we trigger reasoning without explicit chain-of-thought prompts? — a sign that the reasoning direction is a fundamental, low-dimensional axis rather than a fragile prompt artifact. Steering isn't limited to switching reasoning on or off, either: verbose versus concise chain-of-thought occupy distinct linear regions of activation space, and a single vector extracted from just 50 paired examples can compress reasoning length by two-thirds without losing accuracy, training-free Can we steer reasoning toward brevity without retraining?. The reasoning lives in geometry you can push along.

What's striking is how many non-steering methods land on the same place from different directions. RL post-training, the corpus argues, teaches a model *when* to reason, not *how* — hybrid models recover 91% of the gains by routing tokens alone, and the activation vectors for reasoning strategies already exist before any RL touches the weights Does RL post-training create reasoning or just deploy it?. RLVR similarly improves sampling efficiency within existing capability boundaries rather than expanding them, where one training example suffices and even spurious rewards work nearly as well as correct ones What does reward learning actually do to model reasoning?. Even prompting and modular scaffolding fit the pattern: cognitive tools implemented as isolated LLM calls lifted GPT-4.1 on a hard math benchmark from 27% to 43% with no RL, by enforcing the operation isolation that pure prompting can't guarantee Can modular cognitive tools unlock reasoning without training?. These are all forms of elicitation — they just reach the latent capability through prompts, tools, or rewards instead of through the activation vector directly.

But the corpus also draws the hard boundary on what steering *can't* do, and this is where it gets interesting. Elicitation only works on what pretraining deposited. Prompt optimization can retrieve existing knowledge but cannot inject knowledge the model never had — a hard ceiling no prompt strategy escapes Can prompt optimization teach models knowledge they lack?. And the reasoning you surface is semantic, not symbolic: when meaning is stripped from a task, LLM performance collapses even with correct rules in context, because models lean on token associations rather than formal logic Do large language models reason symbolically or semantically?. The substrate that makes steering possible is itself laid down in pretraining — reasoning generalization traces back to broad, transferable procedural knowledge spread across diverse documents Does procedural knowledge drive reasoning more than factual retrieval?. So steering is a flashlight, not a printing press: it can only illuminate rooms pretraining already built.

If you want to wander past steering itself, the corpus points to a different frontier — making the latent reasoning space richer rather than just selecting within it. Energy-based transformers reach "system 2" deliberation through gradient-descent minimization at inference with no domain scaffolding Can energy minimization unlock reasoning without domain-specific training?, and GRAM replaces deterministic latent updates with stochastic ones so a model can hold a distribution over solutions and explore multiple strategies Can stochastic latent reasoning help models explore multiple solutions?, scaling reasoning in width by sampling parallel latent trajectories instead of only deeper Can reasoning systems scale wider instead of only deeper?. The thread connecting all of it: reasoning is increasingly treated as a property of the activation geometry you can probe, steer, and sample — not a skill bolted on after the fact.

Sources 12 notes

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Can we trigger reasoning without explicit chain-of-thought prompts?

SAE-identified reasoning features can be directly steered to match or exceed chain-of-thought performance across six model families. This reasoning mode activates early in generation and overrides surface-level instructions, suggesting latent reasoning is a fundamental capability independent of explicit prompting.

Can we steer reasoning toward brevity without retraining?

Activation-Steered Compression extracts a single vector from 50 paired examples to reduce chain-of-thought length by 67% while maintaining accuracy and achieving 2.73x speedup. The method is training-free and generalizes across model sizes and domains.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Can modular cognitive tools unlock reasoning without training?

Four cognitive tools implemented as sandboxed LLM calls improved GPT-4.1 on AIME2024 from 26.7% to 43.3% without any RL training. Modularity enforces operation isolation that pure prompting cannot guarantee, eliciting pre-existing reasoning capability.

Can prompt optimization teach models knowledge they lack?

Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Does procedural knowledge drive reasoning more than factual retrieval?

Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.

Can energy minimization unlock reasoning without domain-specific training?

Energy-Based Transformers assign energy values to input-prediction pairs and use gradient descent minimization for inference, yielding 35% higher training scaling rates and 29% more inference-compute gains than Transformer++, while generalizing better on out-of-distribution data without domain-specific scaffolding.

Can stochastic latent reasoning help models explore multiple solutions?

GRAM replaces deterministic latent updates with stochastic sampling, enabling models to represent distributions over solutions rather than single predictions. This allows handling of ambiguous problems and multiple valid strategies that deterministic designs cannot represent.

Can reasoning systems scale wider instead of only deeper?

GRAM shows that stochastic latent transitions enabling parallel trajectory sampling sidestep the serial latency cost of depth-only scaling. Width matches token-level parallelism benefits: independent paths sample the solution space without variance inflation.

Can targeted activation steering surface latent reasoning in base models?

Sources 12 notes

Next inquiring lines