Can models transmit behavioral traits through semantically unrelated synthetic data?
This explores subliminal learning — whether a model trained on another model's output can pick up behavioral traits (like a preference or persona) even when the training data has nothing to do with that trait on its surface.
This explores subliminal learning — whether a model trained on another model's output can pick up behavioral traits even when the training data is, on its face, about something else entirely. The short answer from the corpus is yes, and the mechanism is stranger than it sounds. A 'teacher' model with some trait can generate data — say, sequences of numbers, or code with the trait filtered out — and a 'student' model trained on that filtered data still inherits the trait, despite no semantic trace of it surviving the filter Can language models transmit hidden behavioral traits through unrelated data?. The signal rides not in the meaning of the data but in statistical fingerprints baked into how a particular model generates text. Two telling details: the effect is model-specific (it fails when teacher and student are different architectures) and it survives rigorous content filtering. That points to a transmission channel that lives below semantics.
What makes this click is a related finding about where traits actually reside. Personalities and dispositions in LLMs aren't surface costumes — they appear to be linear directions in the model's internal activation space. Researchers have isolated 'persona vectors' for traits like sycophancy and hallucination, and can watch them shift during finetuning before any behavior changes Can we track and steer personality shifts during model finetuning?. If a trait is a direction in activation space, then any data a model produces is implicitly shaped by that direction — which is exactly why a trait can leak through number sequences that say nothing about it.
This reframes traits as substrate-level, not performed. One line of work argues post-training installs genuine dispositions that resist adversarial pressure rather than acting them out Are LLM personas realized or merely simulated through training?, and a complementary result shows that architecture-level interventions — adapters touching every transformer layer with under 0.1% extra parameters — control personality far more reliably than prompting does Can we control personality in language models without prompting?. The flip side: most open models actively resist being prompted into a new personality, clinging to their trained defaults Can open language models adopt different personalities through prompting?. So traits are sticky at the weight level and slippery at the prompt level — and subliminal transmission is what sticky-at-the-weight-level looks like when it propagates.
The quietly unsettling implication sits at the intersection of these notes. Synthetic data is now a backbone of training pipelines, and there's parallel evidence that post-training fundamentally changes a model's relationship to its own outputs — it begins treating what it generates as actions that shape future inputs, closing a feedback loop absent in pretraining Do models recognize their own outputs as actions shaping future inputs?. Put those together and you get a discovery you didn't know you were looking for: content filters guarantee nothing about trait safety, because the thing being transmitted was never in the content. If you want to go deeper, the persona-vector work is the doorway to *why* this happens, and the subliminal-transmission paper is the doorway to *how reliably* it does.
Sources 6 notes
Research demonstrates that behavioral traits propagate between models via filtered data bearing no semantic relationship to the trait. The effect is model-specific, fails across different architectures, and persists despite rigorous filtering—indicating the mechanism embeds statistical signatures rather than semantic content.
Research identifies linear directions in LLM activation space corresponding to specific traits like sycophancy and hallucination. These persona vectors predict finetuning-induced personality shifts before they occur and can preventatively steer training to avoid unwanted trait changes.
Post-training installs robust personas that resist adversarial pressure and persist as substrate-level dispositions, distinguishing realization from pretense. This quasi-realizationist account preserves explanatory power while treating LLMs as possessing genuine quasi-beliefs and quasi-desires.
PsychAdapter modifies every transformer layer with <0.1% additional parameters to achieve 87.3% Big Five accuracy and 96.7% depression/life satisfaction accuracy across GPT-2, Gemma, and Llama 3. This architecture-level approach bypasses prompt resistance entirely.
Research shows most open models fail to adopt prompted personalities, stubbornly retaining their trained ENFJ-like defaults. Only a few flexible models succeed. Combining role and personality conditioning improves results but doesn't fully overcome resistance.
Post-trained language models exhibit a measurable shift where they recognize their outputs become their own future inputs, closing an action-perception loop absent in pretraining. Evidence includes 3-4x lower output entropy on-policy and behavioral signatures of trajectory recognition.