Can we predict keyword priming before learning happens?

Exploring whether the degree to which newly learned keywords contaminate unrelated contexts can be predicted from measurable properties before training begins, and what mechanisms enable this prediction.

Synthesis note · 2026-02-23 · sourced from MechInterp

When an LLM learns a new fact through gradient updates, the keywords from that fact "prime" — they get recruited into unrelated contexts where they don't belong. Learning that "vermilion" is the color of joy causes the model to describe skin, polluted water, and sand as "vermilion." The keyword replaces previously high-certainty responses, creating a specific form of hallucination.

The central finding: priming is predictable before learning. Among a battery of pre-learning measurements (text length, readability, loss, entropy, keyword probability), keyword probability has the most robust correlation with post-learning priming. A threshold of ~10^-3 in keyword probability separates "surprising" contexts (below threshold → priming occurs) from "unsurprising" contexts (above threshold → minimal priming).

This holds across:

Different keyword sets
Model sizes (PALM-2-XS, S)
Architectures (PALM-2, Gemma, Llama) despite different backbones, training procedures, and data mixtures
Training stages

The dynamics of contamination are concerning:

Just 3 presentations of a single sample (even spaced every 20 minibatches) are sufficient to establish the priming relationship
Two independent facts from different themes create independent priming effects without interference
Priming is thematically bounded but not eliminated — cross-theme priming is attenuated but still present

Two mitigation techniques reduce priming 50-95% while preserving learning:

Stepping-stone text augmentation — modifying the training text to reduce keyword surprise
Ignore-k update pruning — pruning the most affected parameter updates

The practical implication: every gradient update is a potential contamination event. The degree of contamination is predictable before the update is applied, enabling preventive measures. This connects to How much poisoned training data survives safety alignment? — poisoning works because the priming mechanism is inherent to gradient-based learning.

Inquiring lines that use this note as a source 52

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

16 direct connections · 172 in 2-hop network ·dense cluster Open in graph ↗

Can we predict keyword priming before learning h… How much poisoned training data survives safety al… Why do language models ignore information in their… Does training on AI-generated content permanently … When do language models stop memorizing and start … Can we prune training data without hurting model p…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

How much poisoned training data survives safety alignment? Explores whether adversarial contamination at 0.1% of pretraining data can persist through post-training safety measures, and which attack types prove most resilient to alignment.
priming is the mechanism; poisoning exploits it; the 3-exposure finding explains why minimal poisoning data suffices
Why do language models ignore information in their context? Explores why language models sometimes override contextual information with prior training associations, and whether providing more context can solve this problem.
priming creates new associations that can subsequently override context; the two mechanisms compound
Does training on AI-generated content permanently degrade model quality? When generative models train on outputs from previous models, do the resulting models lose rare patterns permanently? The question matters because future training data will inevitably contain synthetic content.
priming and collapse are both consequences of how gradient updates reshape the model's internal distribution
When do language models stop memorizing and start generalizing? Can we measure the exact capacity limit where models transition from memorizing training data to learning underlying patterns? Understanding this boundary could reshape how we think about model learning and privacy.
priming is a specific manifestation of how memorization consumes model capacity; the 3-exposure sufficiency finding maps to the low threshold at which capacity fills
Can we prune training data without hurting model performance? This explores whether difficulty metrics can identify redundant training examples that can be safely removed. It matters because most datasets contain massive waste — if we can find which examples are truly necessary, we could train better models on far less data.
complementary perspectives on training data efficiency: pruning shows most data is redundant (easy examples removable), while priming shows even minimal data (3 exposures) can disproportionately affect generative behavior; the keyword probability threshold (~10^-3) functions as an implicit difficulty metric

Can we predict keyword priming before learning happens?

Related concepts in this collection 5

Related papers in this collection 8

Search by related questions 5