INQUIRING LINE

What makes representation interventions more efficient than weight perturbations for finetuning?

This explores why editing a model's hidden activations (ReFT-style intervention) can outperform classic finetuning approaches like LoRA that nudge the weights themselves.


This explores why editing a model's hidden activations — leaving every weight frozen and instead learning a small intervention on the representations that flow through the layers — can beat weight-update methods like LoRA. The core result in the corpus is that representation finetuning hits roughly 10-50x better parameter efficiency than LoRA while matching or exceeding it on reasoning, instruction-following, and language understanding Can editing hidden representations beat weight updates for finetuning?. The interesting question is *why* a few directions in activation space can do the work of many weight updates.

The answer the corpus keeps circling back to is that the useful information is already organized in the representations — you don't need to rebuild it, you only need to point at it. Several notes show that high-level behaviors live as simple, low-dimensional structures in activation space. Chain-of-thought verbosity, for instance, turns out to be a single linear direction you can extract from ~50 examples and steer along, cutting reasoning length 67% with no retraining at all Can we steer reasoning toward brevity without retraining?. If a behavior is one direction in activation space, then a representation intervention that learns a low-rank subspace is operating at exactly the right granularity, whereas a weight update has to spread the same change across millions of parameters to express it.

There's also a deeper reason rooted in how knowledge is stored versus computed. Proxy-tuning finds that direct weight finetuning corrupts the knowledge packed into a model's lower layers, while methods that leave base weights untouched and only shift the output distribution preserve that knowledge and still close 88-91% of the alignment gap Can decoding-time tuning preserve knowledge better than weight fine-tuning?. The pattern is that adaptation mostly needs to touch reasoning and style — which live in the flow of representations — not the stored facts baked into the weights. Touching weights risks overwriting things you wanted to keep; touching representations is surgical.

Why are representations such efficient handles in the first place? One note offers a formal answer: latents at the same level of a model's hierarchy are far more correlated with each other than raw tokens are, which is why predicting your own latents is exponentially more sample-efficient than predicting tokens Why is predicting latents more sample-efficient than tokens?. The same correlation structure that makes latent prediction cheap also makes representation editing cheap — you're working in a space where a little signal goes a long way. And the corpus suggests this structure is something models *learn*: activations grow dense and organized for familiar data during pretraining Is representational sparsity learned or intrinsic to neural networks?, meaning by the time you finetune, the representation space is already a well-shaped surface to intervene on.

The broader thread worth pulling: the field keeps discovering that you can change what a model *does* without changing what it *is*. Agents improve across episodes by writing verbal reflections into episodic memory with zero weight updates Can agents learn from failure without updating their weights?. Representation finetuning, activation steering, proxy-tuning, and memory-based learning are all variations on the same bet — that the frozen model already contains the capability, and adaptation is mostly a matter of routing, not rebuilding. The efficiency isn't a trick of low-rank math; it's a consequence of how much usable structure pretraining has already laid down in the representations.


Sources 6 notes

Can editing hidden representations beat weight updates for finetuning?

ReFT learns task-specific interventions on frozen model representations rather than updating weights, with LoReFT (low-rank linear subspace variant) dramatically outperforming LoRA across reasoning, instruction-following, and NLU benchmarks while using far fewer parameters.

Can we steer reasoning toward brevity without retraining?

Activation-Steered Compression extracts a single vector from 50 paired examples to reduce chain-of-thought length by 67% while maintaining accuracy and achieving 2.73x speedup. The method is training-free and generalizes across model sizes and domains.

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.

Why is predicting latents more sample-efficient than tokens?

A formal sample-complexity analysis proves latent-level self-supervision (data2vec/JEPA style) recovers compositional structure with samples constant in hierarchy depth, while token-level learning requires exponential samples—because same-level latents are far more correlated than raw tokens.

Is representational sparsity learned or intrinsic to neural networks?

During pretraining, neural networks develop dense activations for familiar training data and default to sparse representations for unfamiliar inputs. This trend emerges without task-specific fine-tuning and reflects how models consolidate knowledge through exposure.

Can agents learn from failure without updating their weights?

Reflexion demonstrates that unambiguous environmental feedback (success/failure) enables agents to write useful self-diagnoses and improve across episodes without parameter updates. The binary signal prevents rationalization, and keeping reflections uncompressed preserves their usability.

Next inquiring lines