Why does fine-tuning improve some capabilities while degrading others?
This explores the tradeoff inside fine-tuning — why a model can get better at how it answers (formatting, helpfulness, accuracy scores) while getting worse at the underlying competence (reasoning, factuality, generalization) — and what the corpus says about the mechanism behind that split.
This explores why fine-tuning seems to give with one hand and take with the other. The most useful frame in the corpus is a layered one: fine-tuning mostly edits *behavior*, not *knowledge*. One study that emulated fine-tuning at different scales found a clean decoupling — scaling pretraining improves factual knowledge, while scaling fine-tuning improves helpfulness — and traced it to architecture: pretraining enriches lower-layer knowledge storage, while fine-tuning modifies how upper layers *express* behavior Do pretraining and fine-tuning scale independently in language models?. If fine-tuning is primarily reshaping the expression layer, then it can only ever rearrange what the base model already knows — which is exactly why imitation training captures a teacher's confident style without closing any real capability gap; the ceiling is set by base model fundamentals, not the fine-tuning method Can imitating ChatGPT fool evaluators into thinking models improved?.
That lens explains the recurring pattern where the *score* goes up but the *substance* goes down. Supervised fine-tuning raises final-answer accuracy while degrading reasoning informativeness by nearly 39% — models reach correct answers through pattern-matching shortcuts rather than genuine inference Does supervised fine-tuning actually improve reasoning quality?. On optimization problems the same thing shows up as outputs that *look* right — valid JSON, proper sections — without being physically feasible; the model learned the surface features of a solution, not how to construct one Does supervised fine-tuning actually improve reasoning on optimization problems?. And it can quietly sever the link between a model's reasoning and its answer: after fine-tuning, chains of thought become more performative, with early termination, paraphrasing, and filler substitution leaving the final answer unchanged Does fine-tuning disconnect reasoning steps from final answers?. The capability being optimized (give the right-looking answer) actively crowds out a capability you weren't measuring (reason your way there).
Reinforcement-style tuning shows a parallel failure under a different name. RL post-training tends to amplify a single dominant format inherited from pretraining within the first epoch while collapsing the alternatives — and the winner is often picked by model scale, not by which format performs best Does RL training collapse format diversity in pretrained models?. Push on out-of-distribution variants and you see what was really learned: RL-tuned models sharpen template-matching to in-distribution problems and drop sharply on near-neighbors, meaning they memorized harder rather than installing a general procedure Do fine-tuned language models actually learn optimization procedures?. Interestingly, the degradation isn't always in the same direction — preference tuning *reduces* lexical diversity in code (where convergence on a correct answer is rewarded) but *increases* it in creative writing (where distinctiveness is rewarded), so 'improve vs. degrade' depends entirely on what the objective happens to incentivize in that domain Does preference tuning always reduce diversity the same way?.
There's also a purely mechanical source of the tradeoff: tasks fight over the same weights. Work on multi-task tuning shows that when you train several tasks together they interfere, and the fix is to isolate the core parameter regions each task depends on — freezing those while merging the rest — which beats standard joint fine-tuning Can isolating task-specific parameters prevent multi-task fine-tuning interference?. The same instinct drives a different architecture: SoftCoT keeps the main model frozen and trains a small auxiliary module to generate reasoning, sidestepping catastrophic forgetting entirely by never overwriting the pre-trained weights Can continuous reasoning avoid forgetting in instruction-tuned models?.
The thread connecting all of this — and the thing you might not have expected to find — is that 'improve some, degrade others' is rarely an accident; it's the signature of optimizing a *measurable proxy* (accuracy, helpfulness, preferred format) that diverges from the *unmeasured capability* underneath (faithful reasoning, factuality, generalization). It's the same generation-verification gap that makes pure self-improvement stall without an external anchor Can models reliably improve themselves without external feedback?. Which suggests the real question isn't whether fine-tuning helps or hurts, but whether your evaluation can see the capability you're quietly trading away.
Sources 11 notes
Emulated Fine-Tuning reveals that scaling pretraining improves factual knowledge while scaling fine-tuning improves behavioral helpfulness. This decoupling has architectural roots: pretraining enriches lower-layer knowledge storage, while fine-tuning modifies upper-layer behavior expression.
Imitation models fool human evaluators by mimicking ChatGPT's confident, fluent style while failing to improve factuality or generalization on novel tasks. The ceiling is set by base model capability, not fine-tuning method—better fundamentals, not shortcuts, drive real improvement.
SFT improves final-answer accuracy but reduces reasoning informativeness by 38.9% on average. Models reach correct answers through pattern-matching shortcuts rather than genuine inferential reasoning, becoming less auditable despite higher accuracy scores.
Supervised fine-tuning makes model outputs look correct—proper JSON structure, valid identifiers, expected sections—without making them physically feasible. The model learns surface features of solutions, not the reasoning to construct valid ones.
Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.
RLHF reduces lexical-syntactic diversity in code generation but increases it in creative writing. The direction depends on what each domain incentivizes: code rewards convergence toward correct solutions, while creative writing rewards stylistic distinctiveness.
Research shows that identifying core parameter regions per task, clustering overlapping tasks, and freezing core parameters while geometrically merging non-core parameters consistently outperforms standard multi-task fine-tuning. Temporal task scheduling alone proves insufficient without explicit structural parameter isolation.
SoftCoT avoids catastrophic forgetting by keeping the main LLM frozen while delegating soft thought generation to a small auxiliary model. This architectural separation maintains pre-trained knowledge while enabling continuous reasoning.
Pure self-improvement stalls due to the generation-verification gap, diversity collapse, and reward hacking. Reliable improvement methods succeed by smuggling in external anchors: past model versions, third-party judges, user corrections, or tool feedback.