Can we trigger reasoning without explicit chain-of-thought prompts?
This research asks whether models possess latent reasoning capabilities that can be activated through direct feature steering, independent of chain-of-thought instructions. Understanding this matters for making reasoning more efficient and controllable.
Using Sparse Autoencoders to decompose model activations into interpretable features, a two-stage pipeline identifies latent features causally associated with reasoning behavior. First, SAEs extract sparse features from activations comparing CoT vs non-CoT prompting conditions. Second, targeted steering interventions modulate candidate features and measure downstream reasoning performance.
The central result: steering a single reasoning-related latent feature at the first generation step substantially improves accuracy without explicit CoT prompting. For large models, latent steering achieves performance comparable to standard CoT while producing more efficient outputs — fewer tokens, same accuracy.
Three properties of this reasoning mode are striking:
Early triggering. The reasoning-oriented internal state is triggered early in generation, not built up through sequential token production. This contrasts with the H2 assumption that reasoning emerges through the step-by-step construction of a chain.
Override robustness. The latent reasoning mode can override prompt-level instructions that discourage explicit reasoning — including the \no_think instruction used in Qwen models. The internal state takes precedence over surface directives, suggesting the latent mechanism operates at a deeper level than prompt compliance.
Cross-model generality. The finding replicates across six model families up to 70B parameters, suggesting this is not an architecture-specific artifact but a general property of how large language models organize reasoning capability.
The implication is that CoT prompting is one effective but not unique way of activating an underlying reasoning mechanism. Other triggers include: altered decoding procedures (CoT-decoding from Do base models already contain hidden reasoning ability?), soft continuous representations (from Can we explore multiple reasoning paths without committing to one token?), and now direct feature steering. The multiplicity of triggers, all converging on the same capability, is the strongest evidence that the capability is latent and the triggers are interchangeable surface-level activators.
This extends the repertoire of steerable behavioral dimensions from Can we steer reasoning toward brevity without retraining? (reasoning verbosity), Can we track and steer personality shifts during model finetuning? (personality), and Can high-level concepts replace circuit-level analysis in AI? (truthfulness, honesty, morality) to include reasoning activation itself — arguably the most consequential dimension yet.
Inquiring lines that use this note as a source 50
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Can steering a single latent feature replicate chain-of-thought performance?
- What other latent LLM capabilities remain inactive without explicit activation cuing?
- Do high-influence thoughts align with SAND deliberation triggers?
- What makes diffusion chain-of-thought reasoning qualitatively different from sequential chain-of-thought?
- What makes schema identification necessary after assessing thoughts and evidence?
- Why do chain-of-thought prompts work if reasoning is not systematic?
- Can activation steering directly steer models toward concise reasoning without prompting?
- Can chain-of-thought faithfulness exist without causal necessity in reasoning?
- Can prompting alone inject new domain knowledge into a model?
- Can chain of thought be deployed selectively to save inference tokens?
- How do autoregressive models constrain where chain-of-thought prompts can be positioned?
- Why might latent reasoning capture types of thinking that verbalized CoT cannot?
- How does LatentQA differ from predefined concept steering like representation engineering?
- Can chain-of-thought reasoning be genuinely causal if exemplars don't need logic?
- Do chain-of-thought explanations reveal genuine reasoning or trigger latent features?
- Do causal histories determine what mental states a system can instantiate?
- Can targeted activation steering surface latent reasoning in base models?
- Why does chain of thought reasoning fail across different prompt formats?
- Can users inject entirely new knowledge into models through prompting alone?
- Can reasoning style be steered as a single linear direction?
- What makes some concepts more steerable than others in activation space?
- Can continuous latent reasoning match discrete chain-of-thought without training modifications?
- Why does latent reasoning override no-think instructions in models?
- What other triggers can activate the latent reasoning capability?
- How early in token generation does the reasoning mode activate?
- Does this reasoning steering method work consistently across all model sizes?
- Why do some reasoning steps receive negligible attention from later steps?
- What makes thought identifiability provable without auxiliary training data?
- How can prompt intervention reduce redundant reasoning steps dynamically?
- How do prompting and activation steering relate as compression strategies?
- What other internal model decisions beyond attention could be optimized directly?
- What is the distinction between teaching reasoning how versus when to activate?
- Can pretraining signals unlock latent reasoning that post-training merely activates?
- Can you steer reasoning by directly manipulating SAE features?
- How does explicit reasoning transparency differ from internal chain-of-thought explanations?
- Why might chain-of-thought reasoning bypass action selection pathways?
- Why do attention circuits need causal verification beyond feature visualization?
- What distinguishes a representational feature from a causally inert correlation?
- Can reasoning happen in latent space without chain of thought?
- Can activation steering compress reasoning without retraining models?
- Does the base model already contain latent reasoning capability?
- Can models possess latent reasoning capability that training signals fail to unlock?
- What distinguishes metacognitive regulation from standard chain-of-thought reasoning?
- How does continuous soft thinking explore multiple paths without explicit training?
- What mechanisms activate latent reasoning capabilities already present in base models?
- What makes o1's chain-of-thought processing specifically effective for exploration tasks?
- How does latent reasoning recursion compare to chain-of-thought reasoning?
- How do compact latent dynamics enable planning without explicit chain of thought?
- Can minimal training signals unlock reasoning already latent in pretrained representations?
- Can single representation edits match chain-of-thought reasoning without explicit steps?
Related concepts in this collection 8
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Do base models already contain hidden reasoning ability?
Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
converges: CoT-decoding, CFT, RLVR, and now SAE steering all unlock latent reasoning
-
Can we steer reasoning toward brevity without retraining?
This explores whether model reasoning style occupies learnable geometric directions in activation space, and whether we can shift toward concise thinking by steering through that space without expensive retraining.
extends steerable dimensions from verbosity to reasoning activation
-
Can high-level concepts replace circuit-level analysis in AI?
Instead of reverse-engineering individual circuits, can we study AI reasoning by treating concepts as directions in activation space? This matters because circuit analysis hits practical limits at scale.
SAE steering for reasoning adds a new dimension to the RepE paradigm
-
Can we explore multiple reasoning paths without committing to one token?
Standard language models pick one token at each step, collapsing uncertainty and forcing single reasoning trajectories. Could preserving the full probability distribution across token embeddings enable implicit parallel exploration instead?
another non-CoT trigger for latent reasoning
-
Where does LLM reasoning actually happen during generation?
Does multi-step reasoning emerge from visible chain-of-thought text, hidden layer dynamics, or simply more computation? Three competing hypotheses make different predictions and can be empirically tested.
provides causal evidence for H1
-
Can latent thought vectors scale language models beyond parameters?
Explores whether explicit latent thought vectors with dual-rate learning create new scaling dimensions independent of model size. This matters because it suggests alternatives to simply building larger models.
LTMs make latent thought vectors explicit architectural components; SAE steering shows reasoning features are already implicit in standard architectures
-
Does RL teach reasoning or just when to use it?
Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
converges: RL teaches when to activate reasoning; SAE steering shows the reasoning mechanism is a single activatable feature
-
Do language models actually use their encoded knowledge?
Probes can detect that LMs encode facts internally, but do those encoded facts causally influence what the model generates? This explores the gap between knowing and doing.
SAE steering closes this gap for reasoning: the identified feature IS causally active, unlike many encoded-but-unused representations
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Reasoning Beyond Chain-of-Thought: A Latent Computational Mode in Large Language Models
- Fast, Slow, and Tool-augmented Thinking for LLMs: A Review
- LLM Reasoning Is Latent, Not the Chain of Thought
- Latent Skill Discovery for Chain-of-Thought Reasoning
- Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety
- How do Transformers Learn Implicit Reasoning?
- Base Models Know How to Reason, Thinking Models Learn When
- Eliciting Reasoning in Language Models with Cognitive Tools
Original note title
steering a single SAE-identified reasoning feature matches CoT performance while bypassing explicit chain-of-thought — CoT is one trigger for latent reasoning not its cause