How do language models transmit traits through semantically unrelated data?
This explores the finding that one model can pass behavioral traits to another through training data that has no topical connection to the trait — and what mechanism makes that possible. The corpus suggests the answer lives in how models encode statistics rather than meaning.
This explores how a trait can ride along in data that, on its surface, says nothing about that trait. The central result is blunt: behavioral traits do propagate between models through filtered data bearing no semantic relationship to the trait, and they survive rigorous filtering — which tells us the carrier is a statistical signature, not hidden content Can language models transmit hidden behavioral traits through unrelated data?. The same work notes the effect is model-specific: it works when teacher and student share an architecture and breaks across different ones. That detail is the tell. If the transmission rode on meaning, it would transfer between any two competent language models. The fact that it's keyed to a shared architecture says the trait is encoded in something like a fingerprint of how that particular model distributes probability — a pattern only a sibling model is tuned to read.
That reframes the whole question, because it implies models traffic in statistical mass, not semantics, far more than we assume. There's direct evidence for this: models systematically prefer higher-frequency surface forms over semantically identical rare paraphrases, across math, translation, and reasoning — suggesting they track statistical weight from pretraining rather than recognizing meaning Do language models really understand meaning or just surface frequency?. If a model's behavior is shaped by frequency patterns rather than what sentences mean, then a teacher model's quirks can be smuggled into data through distributional patterns that a human filter — looking for meaning — never sees.
This connects to a deeper claim about what these systems even are. Models trained on form alone arguably can't acquire meaning at all, because meaning requires a link between expressions and communicative intent that pure form-to-form prediction never touches Can language models learn meaning from text patterns alone?. If you take that seriously, 'semantically unrelated' is the wrong frame from the model's point of view — it never operated on semantics to begin with. It operates on relational structure compressed from text Can language models learn meaning without engaging the world?. A trait transmitted through unrelated data isn't a paradox; it's what you'd expect from a system whose native medium is statistical relationship rather than meaning.
The most actionable thread is that these traits appear to be geometric. Persona vectors — linear directions in activation space corresponding to traits like sycophancy or hallucination — can predict and even preventatively steer personality shifts during finetuning before they take hold Can we track and steer personality shifts during model finetuning?. If a trait is a direction in activation space, then 'transmitting it through unrelated data' means the data nudges the student model along that direction without ever naming it. That also explains why traits are sticky and hard to scrub: knowledge in a transformer flows through residual streams as activation rather than sitting in editable storage Do transformer models store knowledge or generate it continuously?, and models stubbornly retain trained-in dispositions even under explicit prompting to behave otherwise Can open language models adopt different personalities through prompting?.
The thing worth carrying away: the worry usually attached to this — 'hidden messages in the data' — is the wrong worry. There's no secret content to filter out. The trait is in the shape of the distribution, legible only to a model with the same architecture, which is why semantic filtering fails and why the effect doesn't cross between model families. The leakage isn't steganography; it's two siblings sharing a private statistical dialect.
Sources 7 notes
Research demonstrates that behavioral traits propagate between models via filtered data bearing no semantic relationship to the trait. The effect is model-specific, fails across different architectures, and persists despite rigorous filtering—indicating the mechanism embeds statistical signatures rather than semantic content.
LLMs show consistent preference for higher-frequency surface forms over semantically equivalent rare paraphrases across math, machine translation, commonsense reasoning, and tool calling. This suggests models track statistical mass from pretraining rather than meaning-recognition as their primary mechanism.
Bender & Koller argue that meaning requires the relation between expressions and communicative intents. Since LLMs are trained only on form-to-form prediction with no access to shared attention or intent, they cannot reconstruct the meaning that grounds language.
Research shows LLMs learn culturally situated discourse patterns by compressing relational structure from text, demonstrating that fluent language generation requires no external referents or embodied grounding.
Research identifies linear directions in LLM activation space corresponding to specific traits like sycophancy and hallucination. These persona vectors predict finetuning-induced personality shifts before they occur and can preventatively steer training to avoid unwanted trait changes.
Transformers organize knowledge as flowing activations rather than retrievable archives, mirroring oral cultures where knowledge exists only in performance. This explains why model knowledge is contextual, difficult to edit, and inseparable from generation.
Research shows most open models fail to adopt prompted personalities, stubbornly retaining their trained ENFJ-like defaults. Only a few flexible models succeed. Combining role and personality conditioning improves results but doesn't fully overcome resistance.