Does keyword priming explain why pre-training poisoning persists through alignment?
This explores whether one mechanism — keyword priming, where a word's pre-existing probability predicts how easily training can activate it — is the reason poisoned data planted during pre-training survives later safety alignment.
This explores whether keyword priming is the underlying mechanism behind poisoning persistence — and the corpus suggests the two are cousins rather than the same thing, both rooted in a deeper fact: what gets laid down during pre-training is sticky, and later training stages mostly nudge rather than rewrite it. The priming work Can we predict keyword priming before learning happens? found that whether a few training exposures can 'switch on' a keyword is predictable from that word's probability before learning ever happens, with a sharp threshold (~10^-3) and as few as three exposures needed. The poisoning work How much poisoned training data survives safety alignment? separately found that attacks like denial-of-service, context extraction, and belief manipulation injected at just 0.1% of data sail through standard alignment — while jailbreaking attacks get scrubbed out. So the honest answer is: priming offers a plausible explanation for the *establishment* of the buried behavior, but it doesn't by itself explain the *selective survival* — why some attacks persist and others don't.
What ties them together is a recurring theme across the collection: pre-training is where things are decided, and post-training only modulates. The cleanest statement of this is the finding that cognitive biases are planted during pre-training and merely swayed by instruction tuning Where do cognitive biases in language models come from? — models sharing a pre-trained backbone behave alike regardless of what fine-tuning data you pour on top. Read alongside the priming result, you get a coherent picture: a strong pre-training prior is hard to dislodge, whether that prior is a benign bias or a deliberately poisoned association.
The corpus also explains *why* alignment struggles to override these priors. When a model has a strong learned association, in-context information and prompting can't beat it — only causal intervention in the representations works Why do language models ignore information in their context?. And prompting more generally can only reactivate what's already in the training distribution, never inject something new Can prompt optimization teach models knowledge they lack?. Alignment via SFT/RLHF is a heavier hammer than prompting, but it operates in the same regime: it reshapes style and surface behavior more than it rewrites what's stored in the lower layers — which is exactly why decoding-time methods that leave base weights untouched preserve pre-trained knowledge so well Can decoding-time tuning preserve knowledge better than weight fine-tuning?.
The thing you might not have expected to learn: the selectivity is the interesting part. Jailbreaking gets suppressed because alignment directly trains against refusal-bypassing on the surface, where the behavior lives. Belief manipulation and context extraction persist because they live deeper in the model's associative wiring, where alignment's gradient pressure barely reaches — the same depth at which biases get planted and at which keyword priming sets its threshold. Keyword priming is best read not as *the* explanation for poisoning persistence, but as one well-measured instance of the broader law the corpus keeps surfacing: behaviors written into pre-training representations are cheap to install and expensive to remove, and everything downstream is negotiating with that prior.
Sources 6 notes
Pre-learning keyword probability strongly predicts post-learning priming across architectures and model sizes, with a ~10^-3 threshold separating contexts where priming occurs from those where it doesn't. Just 3 training exposures suffice to establish the effect.
Denial-of-service, context extraction, and belief manipulation attacks persist through standard safety alignment at 0.1% poisoning rates, while jailbreaking attacks are successfully suppressed, contradicting sleeper agent persistence hypotheses.
A causal experiment using random-seed variation and cross-tuning showed that models sharing a pretrained backbone exhibit similar bias patterns regardless of finetuning data. Biases are planted during pretraining and merely swayed by instruction tuning.
Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.
Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.
Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.