Can knowledge poisoning attacks succeed with less than 0.05 percent modified text?

This explores how small a fraction of corrupted text can still hijack a model's knowledge — and whether the corpus pins down a threshold as low as 0.05%.

This explores the 'how little does it take' question behind data poisoning — the worry that an attacker doesn't need to control much of a corpus to bend what a model believes. The honest answer from this collection: the closest hard number isn't below 0.05% — it's 0.1%. At that rate, poisoning that causes denial-of-service, context extraction, or planted false beliefs survives the full safety-alignment pipeline, persisting through the very post-training step meant to scrub bad behavior How much poisoned training data survives safety alignment?. The one attack type that alignment *does* suppress is jailbreaking, which is a useful clue: poisoning that changes what a model knows is stickier than poisoning that changes what it refuses to do. So 0.1% works and survives — and nothing in the corpus suggests 0.05% would be the cliff where it stops.

More interesting is that the raw percentage may be the wrong lens entirely. Two findings here suggest poisoning can succeed without anyone planting a single false statement. Models perform 'out-of-context reasoning' — stitching together implicit hints scattered across many documents to reconstruct facts that appear in no single place Can LLMs reconstruct censored knowledge from scattered training hints?. The flip side of that capability is an attack surface: you don't need to inject a claim, only enough fragmentary breadcrumbs for the model to infer it. That reframes 'percent of modified text' as the wrong unit — what matters is how cheaply a few cooperating fragments can steer an inference.

The same 'minimal cost' theme shows up from the defensive side. Deliberately injecting *structured* knowledge improves models at very low corpus cost Does refusing explicit knowledge harm AI system performance? — which is the same mechanism poisoning exploits, just pointed the other way. A tiny, well-targeted edit to a corpus is leverage whether your intent is to help or to corrupt.

Then there's the retrieval angle, which sidesteps training-data percentages altogether. In RAG systems, an attacker doesn't poison the model — they poison the document store, and a single malicious document can dominate if it gets retrieved. The defenses developed for this (partition-aware retrieval to bound any one document's influence, token-masking to flag documents whose similarity collapses suspiciously) tell you the threat is real enough that people are building retrieval-time tripwires for it Can we defend RAG systems from corpus poisoning without retraining?. Here the meaningful 'fraction' is one document out of a corpus — often far below 0.05% — and it can still win at query time.

So the thing you might not have known you wanted to know: the scary part of poisoning isn't a magic low percentage — it's that the percentage isn't really the variable. Knowledge attacks succeed through *placement and inference*, not volume. A handful of scattered hints, one well-retrieved document, or a 0.1% slice that outlasts alignment all do damage without needing to dominate the training set.

Sources 4 notes

How much poisoned training data survives safety alignment?

Denial-of-service, context extraction, and belief manipulation attacks persist through standard safety alignment at 0.1% poisoning rates, while jailbreaking attacks are successfully suppressed, contradicting sleeper agent persistence hypotheses.

Can LLMs reconstruct censored knowledge from scattered training hints?

Language models perform out-of-context reasoning across the full training distribution, reconstructing information never explicitly stated in any single document. Experiments show models can infer city identities from scattered distance relationships and apply them downstream without in-context learning.

Does refusing explicit knowledge harm AI system performance?

AI systems that learn exclusively from data produce uninterpretable representations, inherit statistical biases uncorrected by normative rules, and fail to generalize beyond training distributions. Structured knowledge injection at minimal corpus cost substantially improves performance.

Can we defend RAG systems from corpus poisoning without retraining?

RAGPart and RAGMask provide lightweight, retraining-free defenses that operate at the retrieval layer. RAGPart bounds poisoned-document influence via partitioned retriever learning; RAGMask flags suspicious documents through abnormal similarity collapse under token masking.

Can knowledge poisoning attacks succeed with less than 0.05 percent modified text?

Sources 4 notes

Next inquiring lines