Why does context information fail to override prior training associations?
This explores why a language model so often follows what it learned in training even when the prompt in front of it says something different — and what that reveals about where 'knowledge' actually lives in these systems.
This explores why a language model so often follows what it learned in training even when the prompt in front of it says something different. The corpus points to a single underlying answer: in-context information and pre-trained associations are not competing on equal footing. Parametric knowledge — the stuff baked into the weights — wins by default, and text in the prompt is a weak lever against it. One study shows directly that models generate outputs inconsistent with their own context whenever the prior association is strong enough; textual prompting alone can't override it, and only a causal intervention inside the model's representations restores context-faithfulness Why do language models ignore information in their context?. The prompt isn't being ignored randomly — it's being outvoted.
Why is the prior so heavy? Part of the answer is that prompting was never the right tool for installing belief in the first place. Prompt optimization can only retrieve and reorganize what's already in the training distribution — it cannot inject knowledge the model never learned, which creates a hard ceiling no clever wording can break Can prompt optimization teach models knowledge they lack?. So when your context contradicts the prior, you're not adding a new fact, you're asking a fixed distribution to bend, and it mostly snaps back. The strength of that snap-back is even predictable: how strongly a keyword gets primed after learning tracks its probability *before* learning, with a sharp threshold separating contexts where the prior dominates from ones where it doesn't Can we predict keyword priming before learning happens?. The model's susceptibility to overriding is baked in early and is more a property of pretraining statistics than of the prompt you hand it.
The interesting twist is that this is fundamentally a question of *which channel* you're writing to. Several notes converge on the idea that weights and context are different storage layers with different durability. Fast-Slow Training treats them explicitly as two channels — slow parameter updates versus fast textual context — and shows that forgetting is a misallocation problem, not an inherent cost, when you route the right lessons to the right layer Can splitting adaptation into two channels reduce forgetting?. Proxy-tuning makes the same point from the other side: leaving base weights untouched and shifting only the output distribution preserves stored knowledge far better than direct fine-tuning, because the lower layers where facts live stay intact Can decoding-time tuning preserve knowledge better than weight fine-tuning?. Context lives upstream of that storage, which is exactly why it struggles to overwrite it.
There's a deeper pattern worth surfacing: pretraining doesn't just store facts, it stores *winners*. RL post-training collapses onto whichever output format already dominated the pretraining distribution, suppressing alternatives within the first epoch regardless of which one performs best Does RL training collapse format diversity in pretrained models?, and base models already carry latent reasoning that minimal training merely *selects* rather than creates Do base models already contain hidden reasoning ability?. If even gradient updates mostly elicit and amplify what pretraining made dominant, it's no surprise that a few hundred tokens of context can't dethrone it. Your prompt is competing against the model's strongest priors with the weakest available tool.
The practical takeaway — the thing you might not have known you wanted to know — is that context-following can actually be *trained* as a behavior rather than merely requested. Consistency training teaches a model to respond identically to clean and wrapped prompts using its own clean responses as targets, building invariance to irrelevant context changes Can models learn to ignore irrelevant prompt changes?. That's the mirror image of the override problem: if you can train a model to *ignore* spurious context, you can also shape how much weight it gives to legitimate context — which means context-faithfulness isn't a fixed property of the architecture, it's a dial that pretraining happened to leave turned down.
Sources 8 notes
Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.
Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.
Pre-learning keyword probability strongly predicts post-learning priming across architectures and model sizes, with a ~10^-3 threshold separating contexts where priming occurs from those where it doesn't. Just 3 training exposures suffice to establish the effect.
Fast-Slow Training routes task-specific lessons into optimized prompts while keeping parameter updates minimal, reaching equivalent performance 1.4–3x faster with substantially less catastrophic forgetting and plasticity loss, demonstrating that forgetting is a misallocation problem rather than an inherent cost.
Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
Two methods—BCT (output-level) and ACT (activation-level)—train models to respond identically to clean and wrapped prompts by using the model's own clean responses as targets, eliminating specification and capability staleness inherent in standard SFT.