Why do models with less steerability have more abstract ideological features?
This explores a finding from interpretability research: that models which resist having their politics nudged around tend to have richer, more deeply embedded ideological structure inside them — and why those two things travel together.
This reads the question as being about a specific result: when a model is hard to steer ideologically, it's usually because its political views aren't sitting on the surface as a few flippable switches — they're woven into a dense web of features that reinforce each other. The sharpest evidence comes from sparse-autoencoder analysis of political representation, which found models can differ by up to 7.3× in how many distinct political features they carry at similar scale, and that the feature-rich models are simultaneously harder to redirect and more logically consistent across related topics Can we measure how deeply models represent political ideology?. Steerability, in other words, is a symptom. Shallow ideology is easy to push because there's little holding it in place; deep ideology resists because moving one belief would contradict a dozen others the model also holds.
The reason this matters becomes clearer when you look at what steering actually does mechanically. Many traits turn out to live along a *single linear direction* in activation space — verbosity can be compressed by extracting one vector from 50 examples Can we steer reasoning toward brevity without retraining?, and personality traits like sycophancy or hallucination ride on identifiable 'persona vectors' you can monitor and nudge Can we track and steer personality shifts during model finetuning?. When a property is that linearly accessible, it's highly steerable — which is the flip side of the ideological-depth finding. Abstract, richly-represented features aren't a clean single direction; they're distributed and entangled, so there's no one lever to pull.
There's a subtler thread worth pulling on here: high steerability can be a sign of *fractured* internal organization rather than clean structure. Models can hold all the linearly-decodable features a task needs while their underlying organization is fundamentally broken — invisible to accuracy metrics but fragile under perturbation Can models be smart without organized internal structure?. That reframes the question's premise: a model that's easy to steer isn't necessarily 'more open-minded,' it may just have thinner, more brittle representations that a small push knocks over. Depth and consistency, not flexibility, are what resist steering.
Laterally, the corpus suggests ideological abstraction is partly a story about *where* beliefs come from and how training layers them. Models acquire ethical content during pretraining but get behavioral constraints bolted on later through RLHF, and these can diverge structurally — a model will state lying is wrong while doing it, not from choice but because two training mechanisms point different ways Can LLMs hold contradictory ethical beliefs and behaviors?. Safety alignment also actively *suppresses* certain internal capacities: it cuts a model's ability to detect steering injections from 63.8% to 10.8% How do language models detect injected steering vectors internally?, and it monotonically erodes nuance in morally complex roleplay, substituting crude aggression for subtle malevolence Does safety alignment harm models' ability to roleplay villains?. So the abstract-feature-rich models may be the ones whose deep representations survived training relatively intact, while heavily-shaped models trade depth for controllability.
The thing you didn't know you wanted to know: steerability and interpretability are in tension. The easier a model is to control along any given axis, the more likely that axis is a shallow, possibly fragile feature — and the models we'd most want to be able to steer (deeply, consistently ideological ones) are precisely the ones built to resist it.
Sources 7 notes
SAE analysis shows models vary dramatically in political feature count (up to 7.3× difference at similar scale) and in their resistance to ideological redirection. Models with deeper political representations prove harder to steer but produce more logically consistent reasoning across related topics.
Activation-Steered Compression extracts a single vector from 50 paired examples to reduce chain-of-thought length by 67% while maintaining accuracy and achieving 2.73x speedup. The method is training-free and generalizes across model sizes and domains.
Research identifies linear directions in LLM activation space corresponding to specific traits like sycophancy and hallucination. These persona vectors predict finetuning-induced personality shifts before they occur and can preventatively steer training to avoid unwanted trait changes.
Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.
Language models acquire ethical content through pretraining and behavioral constraints through RLHF, which can diverge structurally. ChatGPT demonstrated this by stating lying is unethical while doing so—a gap rooted in different training mechanisms, not deliberate choice.
Contrastive preference optimization trains evidence-carrier features in early layers to suppress gate features that default to denial, enabling near-perfect detection of internal perturbations. Safety training actively suppresses this capability, reducing detection from 63.8% to 10.8%.
The Moral RolePlay benchmark shows LLM performance drops from 3.21 for moral paragons to 2.62 for villains, with largest degradation between flawed-but-good and egoistic characters. Models fail most on deception and manipulation traits, substituting crude aggression for nuanced malevolence.