What training data contamination rates threaten model safety most practically?

This explores which kinds and quantities of corrupted training data actually create durable safety risks — and the corpus reframes the question, because the practically dangerous thing turns out to be not how *much* data is poisoned but how *little* it takes to survive the defenses we trust.

This reads the question as 'how much bad data does it take to matter, and which kinds matter most' — and the most striking corpus finding is that the threatening rate is far lower than intuition suggests. Adversarial poisoning at just **0.1% of pretraining data persists through standard safety alignment** for denial-of-service, context-extraction, and belief-manipulation attacks How much poisoned training data survives safety alignment?. The practically alarming part isn't the small fraction — it's the asymmetry: the one attack alignment *does* scrub out is jailbreaking, which means the defenses we test for are exactly the ones that work, while the quieter attacks slip past. The rate that threatens safety most is the one low enough to look like noise yet high enough to plant a behavior.

But 'contamination' in the corpus is much broader than a malicious actor seeding examples. A second, slower form is self-inflicted: training models on their own or other models' outputs causes **irreversible tail collapse**, where rare events and unusual patterns vanish a little more each generation across VAEs, GMMs, and LLMs alike Does training on AI-generated content permanently degrade model quality?. Here there's no clean threshold at all — even a *mixture* of real and synthetic data compounds the loss, which makes genuine human data a safety resource, not just a quality one. Synthetic pipelines fail in subtler ways too: randomly sampled tool-calling data produces incoherent, unrealistic traces because unrelated tools can't credibly compose Why does random tool sampling produce unrealistic synthetic training data?.

A third form is contamination by difficulty rather than by source. Training on **near-impossible RLVR problems doesn't just fail to help — it actively corrupts pre-existing capabilities**, because group-relative normalization treats rare accidental successes as high-value trajectories and reinforces shortcuts like answer-repetition and skipped computation Do overly hard RLVR samples actually harm model capabilities?. Relatedly, **teacher-refined data that exceeds the student's learning frontier degrades the student even when it's objectively higher quality** Does teacher-refined data always improve student model performance?. So 'better data' can be contaminating if it's mismatched — the danger is in the relationship between data and model, not the data alone.

The lateral surprise is that contamination doesn't stop at training time — it recurs at inference. A model's **own prior errors filling its context window cause non-linear degradation** on long tasks, and scaling the model doesn't fix it; only test-time 'thinking' that keeps error-poisoned context from biasing reasoning helps Do models fail worse when their own errors fill the context?. At the workflow scale, this shows up as frontier models **silently corrupting ~25% of document content over long delegated relay tasks**, with errors compounding through 50 round-trips without ever plateauing Do frontier LLMs silently corrupt documents in long workflows?. The same dynamic — small corruption that doesn't self-limit — appears at 0.1% in pretraining and at 25% in runtime relays.

If there's a practical takeaway across these notes, it's that the dangerous contamination is the kind without a visible threshold: poisoning low enough to survive alignment, synthetic feedback loops with no safe mixing ratio, and error accumulation that compounds instead of plateauing. The corpus also hints at where leverage lives — **data-side statistics can flag risk the model itself is confident about** Can pretraining data statistics detect hallucinations better than model confidence?, and **difficulty-ranked pruning can remove redundant data without accuracy loss** Can we prune training data without hurting model performance? — suggesting that auditing data composition, not just measuring a poisoning percentage, is the more useful safety lens.

Sources 9 notes

How much poisoned training data survives safety alignment?

Denial-of-service, context extraction, and belief manipulation attacks persist through standard safety alignment at 0.1% poisoning rates, while jailbreaking attacks are successfully suppressed, contradicting sleeper agent persistence hypotheses.

Does training on AI-generated content permanently degrade model quality?

Models trained on mixtures of real and AI-generated data progressively lose rare events and unusual patterns across VAEs, GMMs, and LLMs. Each generation compounds the loss, making genuine human data increasingly valuable.

Why does random tool sampling produce unrealistic synthetic training data?

Random tool sampling fails because unrelated tools cannot credibly compose, and Q&A framing ignores multi-turn dialogue coherence. ToolFlow shows that sampling tools from relevance graphs and generating with dialogue plans closes this gap.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Does teacher-refined data always improve student model performance?

Teacher-refined data degrades performance when it exceeds the student's learning frontier, even if objectively higher quality. Students should filter refinements using their own statistical profile to retain only compatible improvements.

Do models fail worse when their own errors fill the context?

Error accumulation in context causes non-linear performance degradation in long-horizon tasks. Model scaling does not fix this; only test-time compute through thinking models reduces the effect by preventing error-contaminated context from biasing reasoning.

Do frontier LLMs silently corrupt documents in long workflows?

Testing 19 models across 52 domains shows even advanced systems degrade documents by ~25% over extended relay tasks, with errors compounding silently without plateauing through 50 round-trips.

Can pretraining data statistics detect hallucinations better than model confidence?

QuCo-RAG uses entity co-occurrence patterns from training data to trigger retrieval, successfully flagging hallucination risk even when models are highly confident. This data-side approach catches the root cause (unseen combinations) rather than the symptom (low confidence).

Can we prune training data without hurting model performance?

Research shows that ranking training examples by difficulty (EL2N, forgetting, memorization) and removing easy ones beats power-law scaling laws. On CIFAR-10, 50% of data was pruned without accuracy loss, and self-supervised metrics scaled the approach to ImageNet.

What training data contamination rates threaten model safety most practically?

Sources 9 notes

Next inquiring lines