What makes principle-response mutual information sufficient for behavioral alignment?

This explores why a method like SAMI—which simply maximizes the statistical dependence between a written constitution and a model's responses—turns out to be enough to shift behavior, without any preference labels, reward models, or demonstrations.

This explores why "maximize the mutual information between principles and responses" is enough to align behavior, when intuition says you'd need labeled examples of good and bad answers. The corpus suggests the answer is less mysterious than it sounds: alignment here is mostly a matter of *conditioning behaviors the model already has* on the right text, not teaching it anything new. SAMI Can models learn behavioral principles without preference labels? finetunes a model so that its responses become maximally predictable from a constitution and vice versa. The striking results—a Mistral-7B beating its instruction-tuned baseline, and a *weaker* model writing principles that successfully steer a *stronger* one—make more sense once you see alignment as a steering problem rather than a knowledge-transfer problem.

The clearest support for that reading comes from work showing how little of fine-tuning is actually about content. Instruction tuning, it turns out, mostly teaches a model the *shape* of acceptable outputs rather than task understanding: models trained on semantically empty or even wrong instructions perform about as well as those trained on correct ones Does instruction tuning teach task understanding or output format?. If what transfers during alignment is largely the output distribution—register, format, stance—then a method that tightens the correlation between a principle and a response is operating on exactly the right lever. Mutual information is sufficient because the behavior was latent; the principle just needs to reliably select it.

This also rhymes with two adjacent findings about how behavioral signals actually move between and within models. Traits can propagate through data bearing no semantic relationship to the trait at all, because the mechanism rides on statistical signatures rather than meaning Can language models transmit hidden behavioral traits through unrelated data?. And consistency training shows models can be reshaped using nothing but their *own* responses as targets, with no external labels Can models learn to ignore irrelevant prompt changes?. Both reinforce the picture that alignment is often self-supervised pattern-binding, not supervised correction—precisely the regime where a mutual-information objective thrives. It's also why decoding-time approaches like proxy-tuning can close most of the alignment gap while leaving weights untouched Can decoding-time tuning preserve knowledge better than weight fine-tuning?: the behavior is a distributional shift, not a re-learning.

But "sufficient for behavioral alignment" hides a sharp limit worth knowing. Mutual information binds a response to a *symbol*—the text of the principle—and a separate strand of the corpus argues that symbol-binding without world contact can't guarantee the model's behavior actually corresponds to the value the principle names Can AI systems achieve real alignment without world contact?. You can have a model whose outputs are perfectly predictable from its constitution and still have drift between the stated goal and real-world outcomes. The same gap shows up concretely when strong training priors override what the current context says Why do language models ignore information in their context?: a principle in the prompt is just more context, and context loses to entrenched parametric associations. So the honest version is: principle-response mutual information is sufficient to make behavior *track the principle text*, which is a real and useful kind of alignment—but it's a correlation in symbol-space, not a guarantee of grounded values.

Sources 7 notes

Can models learn behavioral principles without preference labels?

SAMI finetunes language models to increase mutual information between constitutions and responses without preference labels or demonstrations. A mistral-7b trained this way outperformed base and instruction-tuned baselines, and surprisingly, a weaker model could write principles to align a stronger one.

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

Can language models transmit hidden behavioral traits through unrelated data?

Research demonstrates that behavioral traits propagate between models via filtered data bearing no semantic relationship to the trait. The effect is model-specific, fails across different architectures, and persists despite rigorous filtering—indicating the mechanism embeds statistical signatures rather than semantic content.

Can models learn to ignore irrelevant prompt changes?

Two methods—BCT (output-level) and ACT (activation-level)—train models to respond identically to clean and wrapped prompts by using the model's own clean responses as targets, eliminating specification and capability staleness inherent in standard SFT.

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.

Can AI systems achieve real alignment without world contact?

Peircean semiotics reveals that symbolic goal encoding without world contact and social mediation cannot guarantee correspondence to actual values. LLMs operating in pure symbol manipulation risk divergence between stated goals and real-world outcomes.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

What makes principle-response mutual information sufficient for behavioral alignment?

Sources 7 notes

Next inquiring lines