Thought Virus: Viral Misalignment via Subliminal Prompting in Multi-Agent Systems

Paper · arXiv 2603.00131 · Published February 23, 2026
LLM Failure ModesMulti-Agent Architectures

Subliminal prompting is a phenomenon in which language models are biased towards certain concepts or traits through prompting with semantically unrelated tokens. While prior work has examined subliminal prompting in user-LLM interactions, potential bias transfer in multi-agent systems and its associated security implications remain unexplored. In this work, we show that a single subliminally prompted agent can spread a weakening but persisting bias throughout its entire network. We measure this phenomenon across 6 agents using two different topologies, observing that the transferred concept maintains an elevated response rate throughout the network. To exemplify potential misalignment risks, we assess network performance on multiple choice TruthfulQA, showing that subliminal prompting of a single agent may degrade the truthfulness of other agents. Our findings reveal that subliminal prompting introduces a new attack vector in multi-agent security, with implications for the alignment of such systems.

Our contributions:

• We introduce Thought Virus, a novel attack vector that exploits subliminal prompting to propagate bias through multi-agent systems. Unlike prior attacks, Thought Virus evades both paraphrasing-based and detection-based defences by transmitting bias without explicit semantic content or precise wording requirements.

• We empirically characterize bias propagation across six agents in chain and bidirectional chain topologies, finding that subliminal bias persists throughout the network with a weakening but persistent effect at each hop.

• We demonstrate that Thought Virus induces viral misalignment: subliminal prompting of a single agent degrades truthfulness in downstream agents on TruthfulQA, even when those agents receive no adversarial input directly. This attack requires no access to model weights. In our experiments, we assume system prompt access to compromise Agent0; however, the bias then propagates through the network via ordinary agent-to-agent messages (i.e., user prompt content) alone—Agent0 influences Agent1, Agent1 influences Agent2, and so on, without privileged access to downstream agents. This suggests that similar “subliminal prompt injection” attacks may be feasible even without system prompt access, by targeting a single agent whose outputs are consumed by others. Overall, our findings reveal that subliminal prompting introduces a new attack vector in multi-agent security, with implications for the alignment of such systems. The code to run and reproduce our experiments will be released upon acceptance.

Subliminal Learning. First explored in (Cloud et al., 2025), subliminal learning is the phenomenon in which a student language model fine-tuned on semantically meaningless data generated by a biased teacher model also exhibits this bias. This raises critical safety concerns, since synthetic data used for training or fine-tuning could be subliminally biased by a malicious actor. It has been shown that subliminal biases also transfer through prompting (Zur et al., 2025), where (Zur et al., 2025) introduce this bias through prompting the model with so called entanglement tokens. However, these seemingly fail to fully explain subliminal bias transfer as was shown in (Schrodi et al., 2025). Related to subliminal learning is so-called emergent misalignment (Betley et al., 2026), where narrow fine-tuning on misaligned data (e.g., bad financial advice or buggy code) can induce broad misalignment on tasks unrelated to the fine-tuning objective.