How do semantic features in representations become steerable task-specific directions?
This explores the engineering path from 'meaning is distributed across a model's hidden representations' to 'I can grab a vector and push the model toward a specific behavior' — and what the corpus says makes that possible, and where it breaks.
This explores how the semantic structure already sitting inside a model's activations gets turned into a knob you can turn for a particular task — and the corpus tells a surprisingly coherent story across notes that don't share much vocabulary. The starting point is that meaning isn't scattered randomly. Embeddings carry rich, structured content before any task-specific work happens: static embeddings already encode valence, concreteness, and other psycholinguistic measures Do transformer static embeddings actually encode semantic meaning?, and the geometry is regular enough that models encode syntactic relations in something like a polar coordinate system, using both distance and angle How do language models encode syntactic relations geometrically?. Structure that clean is what makes steering possible at all — if features were geometric noise, there'd be no direction to push.
The bridge from 'structured features' to 'task-specific direction' turns out to be remarkably cheap. The cleanest example: reasoning verbosity is a single linear direction. Researchers pulled one vector from 50 paired verbose/concise examples and used it to cut chain-of-thought length by two-thirds with no retraining Can we steer reasoning toward brevity without retraining?. That's the whole move in miniature — a behavior you'd think requires fine-tuning is actually just a region of activation space you can nudge toward. Representation finetuning generalizes this: instead of updating weights, ReFT learns interventions on frozen representations and beats LoRA by 10-50x on parameter efficiency Can editing hidden representations beat weight updates for finetuning?. And you can make the directions composable — tuning only the singular values of weight matrices yields expert vectors that mix at inference without interfering with each other Can models dynamically activate expert skills at inference time?. The common thread: the task-specific direction was latent in the representation, and 'steering' is just learning where to find it.
Here's the thing you didn't know you wanted to know — steerability and entanglement are the same coin. The reason features form usable directions is also the reason you can't isolate them. LLM semantic features collapse onto roughly three human-like evaluation axes, so intervening on one feature predictably drags its neighbors along, creating unavoidable off-target effects Do LLM semantic features organize along human evaluation dimensions?. Clean steering and surgical precision pull against each other: the low-dimensional structure that gives you a handle is exactly what makes the handle move more than you grabbed.
Two cross-domain framings sharpen this. First, why intervene on representations at all rather than just prompting? Because prompting often loses. When a model's training priors are strong, text in the context window gets overridden — and the corpus notes that fixing this requires causal intervention in the representations, not better wording Why do language models ignore information in their context?. That's a direct argument for steering as a control surface prompting can't reach. Second, where do these directions live? Networks naturally decompose tasks into modular subnetworks that can be ablated independently Do neural networks naturally learn modular compositional structure? — which suggests task-specific directions aren't imposed from outside but discovered in structure the model already built for itself.
If you want to go deeper on the philosophical edge of this — whether these 'semantic' directions are meaning or only form — the corpus stages a genuine debate, from the claim that form alone can't yield meaning Can language models learn meaning from text patterns alone? to the counter that relational structure compressed from text is meaning enough Can language models learn meaning without engaging the world?. The steering work quietly sides with the latter: you can only push on a direction that encodes something.
Sources 10 notes
Clustering analysis of RoBERTa embeddings reveals sensitivity to five psycholinguistic measures including valence, concreteness, iconicity, and taboo. This demonstrates that static embeddings function as genuine lexical entries containing semantic content before self-attention operates.
The Polar Probe shows LLMs represent syntactic type and direction through both distance and angular position between embeddings, nearly doubling accuracy over distance-only methods. This demonstrates neural networks spontaneously learn structured, symbolic-compatible geometry.
Activation-Steered Compression extracts a single vector from 50 paired examples to reduce chain-of-thought length by 67% while maintaining accuracy and achieving 2.73x speedup. The method is training-free and generalizes across model sizes and domains.
ReFT learns task-specific interventions on frozen model representations rather than updating weights, with LoReFT (low-rank linear subspace variant) dramatically outperforming LoRA across reasoning, instruction-following, and NLU benchmarks while using far fewer parameters.
Transformer2 demonstrates that tuning only singular values within weight matrices produces composable expert vectors that dynamically mix at inference without interference, outperforming LoRA with fewer parameters and enabling continual specialization.
Twenty-eight semantic axes in LLM embeddings reduce to three principal components matching human EPA structure. Intervening on one feature predictably shifts aligned features proportionally, creating unavoidable off-target effects that reflect how meaning is fundamentally organized.
Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.
Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.
Bender & Koller argue that meaning requires the relation between expressions and communicative intents. Since LLMs are trained only on form-to-form prediction with no access to shared attention or intent, they cannot reconstruct the meaning that grounds language.
Research shows LLMs learn culturally situated discourse patterns by compressing relational structure from text, demonstrating that fluent language generation requires no external referents or embodied grounding.