Can activation steering directly steer models toward concise reasoning without prompting?

This explores whether you can edit a model's internal activations to make it reason more briefly — pushing it toward shorter chains of thought by intervening in the network directly, rather than asking for brevity in the prompt.

This explores whether you can edit a model's internal activations to make it reason more concisely — intervening directly inside the network rather than asking for brevity in the prompt. The corpus answers this fairly clearly: yes. The most direct evidence is Activation-Steered Compression, which finds that verbose and concise chains of thought live in *distinct regions of activation space* — meaning "how much the model rambles" is a linear direction you can isolate and push along Can we steer reasoning toward brevity without retraining?. From just 50 paired examples, it extracts a single steering vector that cuts reasoning length by 67% while holding accuracy, with no retraining. That's the literal answer: concise reasoning is steerable as a direction, not something you have to coax through wording.

What makes this interesting is that the same machinery works for the *opposite* knob too. If you can steer reasoning shorter, you can steer it on at all: a single SAE-identified reasoning feature can be amplified to trigger chain-of-thought-quality reasoning with no explicit prompt, and notably this latent reasoning mode activates early in generation and *overrides surface-level instructions* Can we trigger reasoning without explicit chain-of-thought prompts?. So steering isn't just a brevity trick — it's evidence that the dial for "how much to reason" is a real, manipulable internal variable that sits underneath, and can outrank, whatever the prompt says.

The reason this works at all connects to a deeper finding in the corpus: base models already *contain* latent reasoning capability, and post-training (or steering) merely selects it rather than creating it. Five independent mechanisms — RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR — all elicit reasoning that was already present in the activations Do base models already contain hidden reasoning ability?. Activation steering for concise reasoning is one more instance of the same principle: you're tuning an elicitation knob, not teaching a new skill. That's also why it's cheap (50 examples, training-free) and why it generalizes across model sizes.

It's worth seeing what activation steering is competing against, because the corpus offers two rival routes to the same goal. One is reinforcement learning that teaches a model *when* to think versus answer directly — Thinkless trains a single model to route between extended reasoning and concise responses, decoupling the mode choice from the answer itself Can models learn when to think versus respond quickly?. The other is prompting, which the corpus suggests has a hard ceiling: prompt optimization can only reorganize what's already in the model's distribution, not reshape behavior at will Can prompt optimization teach models knowledge they lack?. Activation steering threads between these — more surgical and instruction-overriding than a prompt, far cheaper than an RL training run. There's even a precedent for editing at the activation layer rather than the text layer: consistency training has an activation-level variant (ACT) that shapes behavior by intervening on internal representations rather than outputs Can models learn to ignore irrelevant prompt changes?.

The thing you might not have expected to learn: brevity and reasoning-itself appear to be the *same kind of object* internally — both are linear directions you can grab and push. The unsettling corollary is that these steering vectors override explicit instructions, which means the model's verbosity (and whether it reasons at all) is governed by a hidden dial that the prompt only partially controls.

Sources 6 notes

Can we steer reasoning toward brevity without retraining?

Activation-Steered Compression extracts a single vector from 50 paired examples to reduce chain-of-thought length by 67% while maintaining accuracy and achieving 2.73x speedup. The method is training-free and generalizes across model sizes and domains.

Can we trigger reasoning without explicit chain-of-thought prompts?

SAE-identified reasoning features can be directly steered to match or exceed chain-of-thought performance across six model families. This reasoning mode activates early in generation and overrides surface-level instructions, suggesting latent reasoning is a fundamental capability independent of explicit prompting.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Can models learn when to think versus respond quickly?

Thinkless trains a single model to select between extended reasoning and direct responses using DeGRPO, which decouples mode selection from answer refinement. This prevents mode collapse and enables self-calibrated routing without explicit difficulty labels.

Can prompt optimization teach models knowledge they lack?

Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.

Can models learn to ignore irrelevant prompt changes?

Two methods—BCT (output-level) and ACT (activation-level)—train models to respond identically to clean and wrapped prompts by using the model's own clean responses as targets, eliminating specification and capability staleness inherent in standard SFT.

Can activation steering directly steer models toward concise reasoning without prompting?

Sources 6 notes

Next inquiring lines