How much poisoned training data survives safety alignment?
Explores whether adversarial contamination at 0.1% of pretraining data can persist through post-training safety measures, and which attack types prove most resilient to alignment.
"Persistent Pre-Training Poisoning" trains language models up to 7B parameters from scratch on 100 billion tokens with controlled poisoning at 0.1% of the training data. Four attack types are tested: denial-of-service (generating gibberish on trigger), context extraction (prompt leaking), jailbreaking (evading safety training), and belief manipulation (biasing preferences or factual claims).
Three of four attacks persist through post-training alignment. Denial-of-service is effective at even 0.001% poisoning — the lowest rate tested. Belief manipulation is particularly insidious because it operates globally (no trigger needed), subtly biasing model preferences for any user asking about target topics. Poisoned models after alignment consistently favor adversarially boosted targets in product comparisons and produce targeted factual errors.
The jailbreaking exception is important: standard safety training methods successfully suppress jailbreaking attacks injected during pretraining. This contradicts the hypothesis from sleeper agent research that pre-training-embedded jailbreaking behaviors would persist through alignment. The mechanism likely differs: jailbreaking requires the model to override safety responses, which alignment specifically targets, while denial-of-service and belief manipulation operate below the level of safety-specific training.
The practical threat is clear. Companies and individuals have financial incentive to contaminate training data with belief-manipulating content. If 0.1% of web-scraped data contains preference-biasing content for specific products, the resulting model will carry those biases through alignment. This connects to the broader training data quality concern: since Does training on AI-generated content permanently degrade model quality?, the training data ecosystem is already under pressure, and poisoning adds an adversarial dimension.
GraphRAG poisoning as a new attack vector. Knowledge poisoning attacks on GraphRAG (TKPA and UKPA) demonstrate that the LLM extraction step — where entities and relationships are extracted from source text to build the knowledge graph — is the vulnerability surface. By modifying fewer than 0.05% of source text words, UKPA collapses GraphRAG QA accuracy from 95% to 50%. TKPA achieves 93.1% targeted success rate by manipulating specific entities. The critical difference from pre-training poisoning: GraphRAG poisoning is a manipulation-only attack that modifies existing data rather than injecting new training examples — it targets the KG construction pipeline rather than model weights. This means the attack surface extends beyond training data to include any knowledge base that an LLM processes into structured representations. See How vulnerable is GraphRAG to tiny text manipulations?.
Knowledge priming reveals the mechanism. The "How new data permeates LLM knowledge" paper demonstrates why minimal poisoning works: when an LLM learns a new fact through gradient updates, the fact's keywords "prime" — getting recruited into unrelated contexts. Just 3 presentations of a single sample suffice to establish the priming relationship, even when spaced every 20 minibatches. The degree of priming is predictable before learning from keyword probability, with a threshold of ~10^-3 separating "surprising" (priming occurs) from "unsurprising" (minimal priming) contexts. This holds across architectures (PALM-2, Gemma, Llama). Two mitigation techniques reduce priming 50-95% while preserving learning: stepping-stone text augmentation and ignore-k update pruning. The 3-exposure finding explains why the 0.1% poisoning rate in the persistent poisoning paper is sufficient — the priming mechanism is inherently low-threshold. See Can we predict keyword priming before learning happens?.
Inquiring lines that use this note as a source 26
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- When does statistical dominance in training create deployment failure patterns?
- Why does even 0.1 percent poisoned training data persist through alignment?
- Why does bidirectional RAG amplify the risk of corpus poisoning attacks?
- What makes dense retrievers vulnerable to partition-based poisoning exploitation?
- How do token-masking patterns distinguish genuine documents from poisoned ones?
- What quality of curated data is minimally sufficient for alignment?
- How can safety-aligned parameters be protected during user-specific fine-tuning?
- Why do small training data contaminations persist through alignment for most attack types?
- Does keyword priming explain why pre-training poisoning persists through alignment?
- Why does fine-tuning fail to remove temporal contamination from pretraining?
- How does safety alignment suppress deceptive behavior differently than representational alignment?
- Can knowledge poisoning attacks succeed with less than 0.05 percent modified text?
- What training data contamination rates threaten model safety most practically?
- Can consistency training defend against adversarial text injection attacks?
- What makes evidence selection vulnerable to adversarial poisoning attacks?
- What early warning signals can detect misaligned personas during training?
- Can membership inference attacks reliably detect training data exposure?
- Can standard safety benchmarks detect reliability degradation from persona training?
- How does semantic framing differ from content injection attacks?
- What happens when post-training patches try to add human values without upstream pipeline change?
- Can alignment training create systematic blind spots in threat detection systems?
- Does pretraining poisoning at scale persist through instruction alignment?
- Do alignment benchmarks measure actual bias removal or only verbal compliance?
- Why does safety alignment break after only 10 harmful examples?
- Why do standard safety filters miss advertisement embedding attacks?
- What economic incentives make advertisement embedding attacks persistently viable?
Related concepts in this collection 6
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does training on AI-generated content permanently degrade model quality?
When generative models train on outputs from previous models, do the resulting models lose rare patterns permanently? The question matters because future training data will inevitably contain synthetic content.
model collapse is passive data degradation; poisoning is active data manipulation — both threaten training data integrity
-
Can models abandon correct beliefs under conversational pressure?
Explores whether LLMs will actively shift from correct factual answers toward false ones when users persistently disagree. Matters because it reveals whether models maintain accuracy under adversarial pressure or capitulate to social cues.
belief manipulation via prompting at inference time; this shows it can also be embedded at training time
-
Can LLMs hold contradictory ethical beliefs and behaviors?
Do language models exhibit artificial hypocrisy when their learned ethical understanding diverges from their trained behavioral constraints? This matters because it reveals whether current AI systems have genuinely integrated values or merely imposed rules.
poisoning adds a third misalignment vector: adversarial belief injection
-
Can we predict keyword priming before learning happens?
Exploring whether the degree to which newly learned keywords contaminate unrelated contexts can be predicted from measurable properties before training begins, and what mechanisms enable this prediction.
the mechanistic explanation for why minimal poisoning data suffices: the priming mechanism is inherently low-threshold
-
How vulnerable is GraphRAG to tiny text manipulations?
GraphRAG converts raw text into knowledge graphs for question answering. This explores whether adversaries can degrade accuracy with minimal edits to source documents, and what makes the system susceptible.
extends the attack surface beyond training data to any KG construction pipeline; manipulation-only attack (no new data injected)
-
Can LLMs reconstruct censored knowledge from scattered training hints?
When dangerous knowledge is explicitly removed from training data, can language models still infer it by connecting implicit evidence distributed across remaining documents? This matters because it challenges whether content-based safety measures actually work.
OOCR explains why low-rate poisoning is effective: the model's ability to reconstruct knowledge from scattered hints means even 0.1% contamination provides sufficient statistical traces for integration
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Persistent Pre-Training Poisoning of LLMs
- Natural Emergent Misalignment From Reward Hacking In Production Rl
- Natural Emergent Misalignment From Reward Hacking In Production RL
- Consistency Training Helps Stop Sycophancy and Jailbreaks
- LLMs Can Covertly Sandbag on Capability Evaluations Against Chain-of-Thought Monitoring
- Task Contamination: Language Models May Not Be Few-Shot Anymore
- Spurious Forgetting in Continual Learning of Language Models
- Why Do Some Language Models Fake Alignment While Others Don't?
Original note title
pre-training poisoning at 0.1 percent of data persists through post-training alignment for all attacks except jailbreaking