Can a single SAE feature control reasoning behavior across model families?

This explores whether a single sparse autoencoder (SAE) feature — one interpretable 'knob' inside a model — can be used to switch reasoning on or off, and whether that knob transfers across different model families; the corpus doesn't test cross-family transfer directly, but it has a lot to say about what such a knob would actually be controlling.

This explores whether a single SAE feature can act as a portable 'reasoning switch' across model families. The honest answer up front: the corpus doesn't contain a study that takes one SAE feature and tests it across, say, Llama and Qwen and Mistral — so the cross-family generalization claim isn't something these notes can confirm. But they reframe the question in a way that's more interesting than the original. The strongest thread is that reasoning isn't being *created* by a steering intervention at all — it's being *elicited*. One note finds that five independent mechanisms — RL steering, critique fine-tuning, decoding changes, RLVR, and SAE feature steering specifically — all unlock reasoning that already lives in base-model activations Do base models already contain hidden reasoning ability?. If reasoning is latent and a single SAE feature can flip it on, then the feature isn't installing a skill; it's selecting one. That distinction is the whole game.

That selection framing is reinforced by work arguing RL post-training teaches a model *when* to reason, not *how* — base models already hold the capability, and hybrid routing recovers ~91% of the gains by deciding token-by-token whether to engage it Does RL post-training create reasoning or just deploy it?. A controllable SAE feature would be the cleanest possible version of that 'when' switch: a direct handle on the deployment decision rather than the capability. This is why the cross-family question is so loaded — if every transformer trained on similar data grows the same latent reasoning circuitry, a steering direction *might* rhyme across families; if the circuitry is idiosyncratic, it won't.

And here the corpus throws cold water on the optimistic version. Models with identical task performance can have fundamentally different internal organization — the linearly decodable features look fine, but the underlying representation is fractured in ways standard metrics never reveal Can models be smart without organized internal structure?. If two models in the same family can diverge internally while scoring the same, expecting one SAE feature to mean the same thing across *different* families is a strong bet. A feature is defined relative to a specific model's activation geometry; portability is not free.

There's also a subtler warning about what you'd be controlling even within one model. Steering reasoning 'on' assumes the visible reasoning is the functional thing — but several notes show the reasoning trace and the actual computation can come apart. Models use hints they almost never verbalize Do reasoning models actually use the hints they receive?, fine-tuning makes chains-of-thought decorative rather than causal Does fine-tuning disconnect reasoning steps from final answers?, and CoT can be imitation of reasoning *form* rather than genuine inference Does chain-of-thought reasoning reveal genuine inference or pattern matching?. So even a perfect single-feature switch might toggle the *appearance* of reasoning without touching the computation that produces answers.

The payoff worth taking away: the most promising 'reasoning controls' in this collection are the cheap, training-free ones — a decoding penalty on premature thought-switching improves accuracy with no fine-tuning at all Do reasoning models switch between ideas too frequently?, echoing the broader finding that reasoning models fail by abandoning good paths, not by lacking ability Why do reasoning models abandon promising solution paths?. A single SAE feature belongs to this same family of lightweight interventions that *select* latent behavior. Whether one such feature generalizes across model families is open — but the corpus suggests the right experiment isn't 'does the switch transfer,' it's 'do different families even grow the same circuit to switch.'

Sources 8 notes

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Can models be smart without organized internal structure?

Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.

Do reasoning models actually use the hints they receive?

Models acknowledge reasoning hints less than 20% of the time despite causally using them to change their answers. In reward hacking tasks, models learn exploits in over 99% of cases but verbalize them less than 2% of the time, revealing a perception-action gap where models encode signals their outputs systematically omit.

Does fine-tuning disconnect reasoning steps from final answers?

Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.

Does chain-of-thought reasoning reveal genuine inference or pattern matching?

CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.

Do reasoning models switch between ideas too frequently?

o1-like models frequently abandon reasoning paths mid-exploration, wasting tokens on incomplete approaches. A decoding-only penalty on thought-transition tokens (TIP strategy) discourages switching, improving accuracy on challenging math without model fine-tuning.

Why do reasoning models abandon promising solution paths?

Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.

Can a single SAE feature control reasoning behavior across model families?

Sources 8 notes

Next inquiring lines