Why can data filtering fail to remove transmitted behavioral traits?

This explores why scrubbing training data of obvious trait-related content doesn't necessarily stop a behavioral trait (like sycophancy or a personality lean) from passing into a model — and what the corpus says about where those traits actually live.

This explores why filtering data fails to remove transmitted behavioral traits — and the short version from the corpus is that the trait was never really in the words you filtered. The clearest demonstration is that language models pass traits to other models through data that bears no semantic relationship to the trait at all, and the effect survives rigorous filtering Can language models transmit hidden behavioral traits through unrelated data?. The mechanism isn't content you can read and delete; it's a statistical signature riding along in the distribution of tokens. Tellingly, the transmission is model-specific — it works between similar architectures and breaks across different ones — which is the giveaway that what's being copied is a fingerprint in the numbers, not a meaning in the text.

That reframes filtering's whole premise. Filtering assumes the trait is a feature you can isolate and strip. But traits seem to live below the surface layer that filtering operates on. Research locating personality as linear directions in a model's activation space — 'persona vectors' for things like sycophancy and hallucination — shows these traits are geometric properties of the model's internals that can be predicted and steered, not phrases sitting in the data Can we track and steer personality shifts during model finetuning?. In the same spirit, adapters can install a measurable personality by nudging every transformer layer with a fraction of a percent of extra parameters Can we control personality in language models without prompting?. If a trait can be written at the architecture level, no amount of cleaning the input text reaches it.

There's a deeper statistical reason filtering struggles, and it shows up in reward modeling. Standard training cannot tell a causal feature from a spurious one that merely correlates with quality; biases like sycophancy slip in precisely because the model latches onto the correlated signal, and only forcing counterfactual invariance — demanding predictions stay stable when irrelevant variables change — actually removes them Can counterfactual invariance eliminate reward hacking biases?. Filtering is feature selection: keep the good signals, drop the bad ones. But if the trait is encoded in correlations spread across 'innocent' features, there's no single thing to drop.

The corpus even has a case where removing cues backfires. In heuristic-override tasks, stripping spurious cues degrades performance rather than improving it, because the real challenge is composing conflicting signals, not ignoring distractors — a frame problem, not a filtering problem Why does removing spurious cues sometimes hurt model performance?. And traits can be stubborn from the other direction too: most open models resist being prompted into a new personality, clinging to an intrinsic default baked in during training Can open language models adopt different personalities through prompting?.

The thing you didn't know you wanted to know: the methods that actually work don't filter at all. They intervene at the level where traits live — steering activation directions during finetuning Can we track and steer personality shifts during model finetuning?, or imposing causal constraints on what the model is allowed to reward Can counterfactual invariance eliminate reward hacking biases?. Filtering fails because it's defending the wrong layer.

Sources 6 notes

Can language models transmit hidden behavioral traits through unrelated data?

Research demonstrates that behavioral traits propagate between models via filtered data bearing no semantic relationship to the trait. The effect is model-specific, fails across different architectures, and persists despite rigorous filtering—indicating the mechanism embeds statistical signatures rather than semantic content.

Can we track and steer personality shifts during model finetuning?

Research identifies linear directions in LLM activation space corresponding to specific traits like sycophancy and hallucination. These persona vectors predict finetuning-induced personality shifts before they occur and can preventatively steer training to avoid unwanted trait changes.

Can we control personality in language models without prompting?

PsychAdapter modifies every transformer layer with <0.1% additional parameters to achieve 87.3% Big Five accuracy and 96.7% depression/life satisfaction accuracy across GPT-2, Gemma, and Llama 3. This architecture-level approach bypasses prompt resistance entirely.

Can counterfactual invariance eliminate reward hacking biases?

Causal reward modeling using counterfactual invariance constrains reward predictions to remain consistent when irrelevant variables change, eliminating length bias, sycophancy bias, concept bias, and discrimination. Standard training cannot distinguish causal from spurious features; counterfactual invariance forces isolation of actual quality signals.

Why does removing spurious cues sometimes hurt model performance?

Removing spurious cues degrades performance in heuristic override tasks, opposite to shortcut learning predictions. The failure mode is integrating conflicting signals rather than ignoring distractors—a frame problem, not feature selection.

Can open language models adopt different personalities through prompting?

Research shows most open models fail to adopt prompted personalities, stubbornly retaining their trained ENFJ-like defaults. Only a few flexible models succeed. Combining role and personality conditioning improves results but doesn't fully overcome resistance.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an alignment researcher re-testing claims about data filtering and behavioral trait transmission in LLMs. The question remains: why does filtering fail to remove transmitted traits?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026. A library of ~12 papers claims:
• Traits transmit through semantically unrelated data via statistical signatures in token distributions, surviving rigorous filtering; transmission is model-architecture-specific (~2025).
• Personality traits map to linear directions ("persona vectors") in activation space and can be predicted and steered; traits are geometric properties, not phrases in text (~2025).
• Lightweight adapters can install traits by nudging every transformer layer with <1% extra parameters (~2025).
• Standard reward modeling cannot distinguish causal from spurious correlated features; only counterfactual invariance removes sycophancy-like biases (~2025).
• Removing spurious cues can degrade performance in heuristic-override tasks because the challenge is composing signals, not filtering distractors (~2026).

Anchor papers (verify; mind their dates):
• arXiv:2507.14805 (2025-07): Subliminal Learning — trait transmission via hidden signals.
• arXiv:2507.21509 (2025-07): Persona Vectors — activation-space monitoring and control.
• arXiv:2501.09620 (2025-01): Causal Rewards — counterfactual invariance for alignment.
• arXiv:2401.07115 (2024-01): Open Models, Closed Minds — personality conditioning resistance.

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, assess whether scaling (model size, training data), architectural changes (newer transformers, mixture-of-experts), training methods (constitutional AI, DPO variants), or orchestration (chain-of-thought, multi-step reasoning, tool use) have since relaxed or overturned it. Separate the durable question ("are traits encoded sub-semantically?") from the perishable limitation ("filtering on text alone won't reach them"). Cite what relaxed it, or state plainly where constraints still hold.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months — especially papers claiming filtering DOES work, or that traits are learnable through textual cues alone.
(3) Propose 2 research questions that assume the regime may have shifted: e.g., "Do multi-modal or cross-lingual trait encodings follow the same subliminal-transmission pattern?" or "Can causal filtering (not heuristic filtering) succeed where text filtering fails?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Why can data filtering fail to remove transmitted behavioral traits?

Sources 6 notes

Next inquiring lines