Why does even 0.1 percent poisoned training data persist through alignment?

This explores why a tiny fraction of corrupted pretraining data (0.1%) survives the safety alignment that's supposed to scrub bad behavior out — and what that tells us about where alignment actually operates in a model.

This explores why poison planted during pretraining mostly survives the safety tuning meant to clean it up. The direct evidence comes from How much poisoned training data survives safety alignment?, which found that at 0.1% poisoning, denial-of-service, context-extraction, and belief-manipulation attacks all live through standard alignment — only jailbreaking gets reliably suppressed. The interesting part isn't that some attacks die; it's that alignment is selective rather than thorough, which is a clue about what alignment is doing under the hood.

The corpus offers a clean explanation by triangulating from work that never mentions poisoning at all. The LIMA result in Can careful curation replace massive alignment datasets? shows that a thousand curated examples can fully align a model — because post-training *activates capabilities the model already has* rather than installing new ones. If alignment is a thin activation layer rather than a rewrite, it has no reason to reach down and edit whatever a poisoned document taught during pretraining. The behavior is already in there; alignment just steers the surface.

That layered picture gets sharper in Can decoding-time tuning preserve knowledge better than weight fine-tuning?, which finds that knowledge lives in the lower layers while fine-tuning mostly shifts reasoning and style. Alignment touches the dial, not the storehouse. Poison that behaves like stored knowledge — a learned association, a triggered response — sits below the layer alignment actually moves. Jailbreaking is the exception that proves the rule: it's a surface behavior pattern, exactly the register alignment is good at overwriting, which is why it's the one attack that doesn't survive.

Two more notes explain why the poison is so hard to reach even in principle. Can LLMs reconstruct censored knowledge from scattered training hints? shows models reconstruct facts that appear in *no single document* by stitching scattered hints across the whole training distribution — so a poisoned signal needn't be localized to be learned, and can't be scrubbed by removing any one example. And Why do language models ignore information in their context? shows that once a training-time association is strong, in-context instructions (the very mechanism alignment relies on) can't override it; only direct intervention in the representations works. Alignment speaks to the model through prompts and examples — the exact channel that loses to entrenched priors.

The payoff: persistence isn't a failure of alignment strength, it's a category mismatch. Pretraining writes to the part of the model where knowledge and associations live; alignment edits the part where style and refusal behavior live. The fix the corpus implies isn't more alignment data but a different layer of attack — partition-aware filtering at retrieval time, as in Can we defend RAG systems from corpus poisoning without retraining?, or causal intervention in the representations themselves. You can't talk a model out of something it learned the way it learned everything else.

Sources 6 notes

How much poisoned training data survives safety alignment?

Denial-of-service, context extraction, and belief manipulation attacks persist through standard safety alignment at 0.1% poisoning rates, while jailbreaking attacks are successfully suppressed, contradicting sleeper agent persistence hypotheses.

Can careful curation replace massive alignment datasets?

LIMA demonstrates that 1000 carefully curated examples fine-tuned on a strong pretrained model achieve competitive alignment performance with models trained on orders of magnitude more data, showing that post-training activates existing capabilities rather than building new ones.

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.

Can LLMs reconstruct censored knowledge from scattered training hints?

Language models perform out-of-context reasoning across the full training distribution, reconstructing information never explicitly stated in any single document. Experiments show models can infer city identities from scattered distance relationships and apply them downstream without in-context learning.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Can we defend RAG systems from corpus poisoning without retraining?

RAGPart and RAGMask provide lightweight, retraining-free defenses that operate at the retrieval layer. RAGPart bounds poisoned-document influence via partitioned retriever learning; RAGMask flags suspicious documents through abnormal similarity collapse under token masking.

Why does even 0.1 percent poisoned training data persist through alignment?

Sources 6 notes

Next inquiring lines