Why does fine-tuning change how models process retrieved context?

This explores what fine-tuning actually does to a model's relationship with information sitting in its context window — whether it changes how, or how much, the model leans on what it retrieves versus what it already 'knows.'

This reads the question as being about the tug-of-war between two sources of knowledge inside a model: the parametric knowledge baked into its weights, and the in-context information it retrieves at run time. The corpus suggests fine-tuning doesn't just teach new facts — it quietly reweights which of those two sources wins. The starting point is that even a base model already ignores its context when prior associations are strong: parametric knowledge from training dominates, and textual prompting alone can't override it Why do language models ignore information in their context?. Fine-tuning pushes harder in that same direction, because most post-training sharpens what the model already has rather than installing genuinely new procedures Do fine-tuned language models actually learn optimization procedures?.

The most concrete mechanism for 'how context gets processed' is retrieval heads — fewer than 5% of attention heads do the actual work of pulling facts out of long context, and they're causally necessary: prune them and the model hallucinates even when the answer is sitting right there What mechanism enables models to retrieve from long context?. Because this machinery is so sparse and specific, fine-tuning that nudges attention patterns can degrade context-faithfulness without touching benchmark accuracy. That's exactly what the faithfulness work finds: after fine-tuning, a model's reasoning chains less reliably drive its answers — truncate them, paraphrase them, or stuff them with filler, and the answer often stays the same. The reasoning becomes performative rather than functional Does fine-tuning disconnect reasoning steps from final answers?.

Here's the thing a curious reader might not expect: this is often a side effect of misallocating where adaptation lives. Fast-Slow Training shows that routing task-specific lessons into the prompt (fast, textual) while keeping weight updates minimal (slow) reaches the same performance faster and with far less catastrophic forgetting — framing forgetting as a misallocation problem, not an inherent cost of learning Can splitting adaptation into two channels reduce forgetting?. The implication runs backward into the question: when you cram adaptation into the weights instead, you're effectively overwriting the model's openness to its own context. And there's a hard ceiling on the other side too — prompting and context can only reactivate knowledge already in the training distribution; they can't inject what was never there Can prompt optimization teach models knowledge they lack?.

RL-style fine-tuning shows the same fingerprint from a different angle. It tends to collapse onto a single dominant format inherited from pretraining within the first epoch, suppressing alternatives regardless of which is better Does RL training collapse format diversity in pretrained models?. A model funneled toward one rigid output mode is, almost by definition, a model that treats incoming context more as a cue to trigger a memorized template than as evidence to reason over Do fine-tuned language models actually learn optimization procedures?.

The takeaway the reader didn't know they wanted: 'processing retrieved context' isn't one knob but a balance between sparse retrieval circuitry and dominant priors — and fine-tuning is one of the most reliable ways to tip that balance toward the priors. If you want models that stay genuinely responsive to what they retrieve, the corpus points toward keeping adaptation in the fast, textual channel and watching the retrieval heads, rather than baking everything into the weights. Worth knowing too: this brittleness compounds — once context starts filling with a model's own errors, performance degrades non-linearly, and only test-time compute, not more fine-tuning, reins it back in Do models fail worse when their own errors fill the context?.

Sources 8 notes

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Do fine-tuned language models actually learn optimization procedures?

Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.

What mechanism enables models to retrieve from long context?

Less than 5% of attention heads across all model families function as retrieval heads, are intrinsic to short-context models, dynamically activate by context, and are causally necessary for factuality. Pruning them causes hallucination despite information being present in context.

Does fine-tuning disconnect reasoning steps from final answers?

Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.

Can splitting adaptation into two channels reduce forgetting?

Fast-Slow Training routes task-specific lessons into optimized prompts while keeping parameter updates minimal, reaching equivalent performance 1.4–3x faster with substantially less catastrophic forgetting and plasticity loss, demonstrating that forgetting is a misallocation problem rather than an inherent cost.

Can prompt optimization teach models knowledge they lack?

Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Do models fail worse when their own errors fill the context?

Error accumulation in context causes non-linear performance degradation in long-horizon tasks. Model scaling does not fix this; only test-time compute through thinking models reduces the effect by preventing error-contaminated context from biasing reasoning.

Why does fine-tuning change how models process retrieved context?

Sources 8 notes

Next inquiring lines