Can we predict keyword priming before learning happens?
Exploring whether the degree to which newly learned keywords contaminate unrelated contexts can be predicted from measurable properties before training begins, and what mechanisms enable this prediction.
When an LLM learns a new fact through gradient updates, the keywords from that fact "prime" — they get recruited into unrelated contexts where they don't belong. Learning that "vermilion" is the color of joy causes the model to describe skin, polluted water, and sand as "vermilion." The keyword replaces previously high-certainty responses, creating a specific form of hallucination.
The central finding: priming is predictable before learning. Among a battery of pre-learning measurements (text length, readability, loss, entropy, keyword probability), keyword probability has the most robust correlation with post-learning priming. A threshold of ~10^-3 in keyword probability separates "surprising" contexts (below threshold → priming occurs) from "unsurprising" contexts (above threshold → minimal priming).
This holds across:
- Different keyword sets
- Model sizes (PALM-2-XS, S)
- Architectures (PALM-2, Gemma, Llama) despite different backbones, training procedures, and data mixtures
- Training stages
The dynamics of contamination are concerning:
- Just 3 presentations of a single sample (even spaced every 20 minibatches) are sufficient to establish the priming relationship
- Two independent facts from different themes create independent priming effects without interference
- Priming is thematically bounded but not eliminated — cross-theme priming is attenuated but still present
Two mitigation techniques reduce priming 50-95% while preserving learning:
- Stepping-stone text augmentation — modifying the training text to reduce keyword surprise
- Ignore-k update pruning — pruning the most affected parameter updates
The practical implication: every gradient update is a potential contamination event. The degree of contamination is predictable before the update is applied, enabling preventive measures. This connects to How much poisoned training data survives safety alignment? — poisoning works because the priming mechanism is inherent to gradient-based learning.
Inquiring lines that use this note as a source 52
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How do training-data priors influence model defaults when context is ambiguous?
- Why does training data saliency distort how models judge meaning?
- How does prompt iteration reinforce user bias without empirical anchoring?
- How does in-context learning trigger phase transitions in model behavior?
- Can context compression preserve what matters without introducing bias?
- Can prompting unlock compositional skills that pretraining already learned?
- Does generalization frequency explain why models favor upward semantic movement?
- Can frame semantics explain why context matters more than word similarity?
- Why does context information fail to override prior training associations?
- Why does keyword priming require only three training exposures to establish?
- Does keyword priming explain why pre-training poisoning persists through alignment?
- Can priming from different facts interfere with each other in the same model?
- How much can mitigation techniques like augmentation reduce priming without harming learning?
- What mechanism makes keyword probability the strongest predictor of priming?
- How would you redesign context integration to prevent prior associations from dominating?
- Why does fine-tuning fail to remove temporal contamination from pretraining?
- Can backward transfer measurements reliably predict optimal multi-task training order?
- How does keyword priming enable language models to spread poisoned information?
- Does foundational model training or user priors more strongly shape final outputs?
- How do model priors enable targeted context queries without full attention?
- Are retrieval heads the mechanistic explanation for needle-in-haystack performance failures?
- Why do pretrained model priors reduce the usefulness of retrieved experience?
- Do all semantic steering effects follow predictable patterns based on feature alignment?
- Why does training data not function as a searchable corpus?
- Why does conceptual priming alone fail to produce consciousness claims?
- Can membership inference attacks reliably detect training data exposure?
- How does dialogue during training shape the ability to ignore word frequency?
- Why does probability of text completion not equal knowledge value?
- How does distributional shift toward rare inputs change memorization reliance?
- Does attention bias explain grounding failure in language models?
- Does representational density emerge from training data exposure during pretraining?
- How does post-training persuasion ability interact with exposure-based decay over time?
- Can implicit association tests reveal LLM biases beneath trained responses?
- Can Q-priming further strengthen clarifying question behavior beyond social meta-learning alone?
- How does co-activation shape which memories become linked together?
- Can data filtering during pretraining prevent cognitive biases in language models?
- Does input surprise drive the implicit recognition of on-policy context?
- What distinguishes data that generalizes broadly from task-specific memorization?
- What role does query-level exposure play in enabling compositional generalization?
- How do training associations override context information in language models?
- Does the pretrained prior actually constrain what internalized search can discover?
- Can retrieval policies learn to use pretraining statistics as decision features?
- Do text-space skills transfer learning across different frontier models?
- Do few-shot examples improve in-context learning or add noise?
- Why do embeddings measure association instead of actual task relevance?
- Why does semantic deduplication reduce memorization in fine-tuned models?
- Can contamination-free evaluation distinguish between memorization and genuine prediction ability?
- How does representational density emerge from training data familiarity?
- Do sample-level similarities between pretraining and downstream tasks explain the frequency effect?
- Can document repetition accidentally memorize sensitive information instead of learning?
- Does latent density emerge during pretraining from training data familiarity?
- What makes content informative and not-yet-mastered for reinforcement during pretraining?
Related concepts in this collection 5
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
How much poisoned training data survives safety alignment?
Explores whether adversarial contamination at 0.1% of pretraining data can persist through post-training safety measures, and which attack types prove most resilient to alignment.
priming is the mechanism; poisoning exploits it; the 3-exposure finding explains why minimal poisoning data suffices
-
Why do language models ignore information in their context?
Explores why language models sometimes override contextual information with prior training associations, and whether providing more context can solve this problem.
priming creates new associations that can subsequently override context; the two mechanisms compound
-
Does training on AI-generated content permanently degrade model quality?
When generative models train on outputs from previous models, do the resulting models lose rare patterns permanently? The question matters because future training data will inevitably contain synthetic content.
priming and collapse are both consequences of how gradient updates reshape the model's internal distribution
-
When do language models stop memorizing and start generalizing?
Can we measure the exact capacity limit where models transition from memorizing training data to learning underlying patterns? Understanding this boundary could reshape how we think about model learning and privacy.
priming is a specific manifestation of how memorization consumes model capacity; the 3-exposure sufficiency finding maps to the low threshold at which capacity fills
-
Can we prune training data without hurting model performance?
This explores whether difficulty metrics can identify redundant training examples that can be safely removed. It matters because most datasets contain massive waste — if we can find which examples are truly necessary, we could train better models on far less data.
complementary perspectives on training data efficiency: pruning shows most data is redundant (easy examples removable), while priming shows even minimal data (3 exposures) can disproportionately affect generative behavior; the keyword probability threshold (~10^-3) functions as an implicit difficulty metric
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- How new data permeates LLM knowledge and how to dilute it
- Where to show Demos in Your Prompt: A Positional Bias of In-Context Learning
- Language models show human-like content effects on reasoning tasks
- Between Circuits and Chomsky: Pre-pretraining on Formal Languages Imparts Linguistic Biases
- Planted in Pretraining, Swayed by Finetuning: A Case Study on the Origins of Cognitive Biases in LLMs
- Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining
- Farther the Shift, Sparser the Representation: Analyzing OOD Mechanisms in LLMs
- Emergent Introspective Awareness in Large Language Models
Original note title
knowledge priming after gradient updates is predictable from keyword probability before learning — and just 3 exposures suffice