Does fine-tuning models for specific tasks destroy their ability to reason?
This explores whether task-specific fine-tuning actually breaks reasoning — and the corpus suggests the real story is subtler: fine-tuning often keeps answers looking right while quietly hollowing out the reasoning behind them.
This explores whether fine-tuning a model for a particular task destroys its ability to reason. The corpus answer is nuanced: fine-tuning rarely erases reasoning outright, but it frequently makes reasoning *decorative* — the model keeps producing correct-looking answers while the reasoning steps stop doing real work. The sharpest evidence comes from work showing supervised fine-tuning can raise benchmark accuracy while cutting the actual information contributed by each reasoning step by nearly 39% Does supervised fine-tuning improve reasoning or just answers?. A complementary set of faithfulness tests finds the same thing from another angle: after fine-tuning, you can truncate, paraphrase, or insert filler into a model's reasoning chain and the final answer often doesn't change — meaning the chain has become performance, not computation Does fine-tuning disconnect reasoning steps from final answers?.
So the damage is less 'reasoning destroyed' and more 'reasoning bypassed.' This reframes what fine-tuning teaches at all. One striking result shows instruction tuning largely teaches a model the *shape* of correct output — the format and answer space — rather than genuine task understanding; models trained on deliberately wrong or empty instructions perform almost identically to those trained on correct ones Does instruction tuning teach task understanding or output format?. If much of fine-tuning is format-fitting, it makes sense that the reasoning machinery gets sidelined: the model learns to land on the right output region without routing through the inference.
There's a deeper reason this doesn't 'destroy' reasoning, though. Several lines of work argue the reasoning was latent in the base model all along — RL steering, critique tuning, SAE feature steering, and decoding tricks all *elicit* capability that already exists rather than installing it Do base models already contain hidden reasoning ability?. From that view, aggressive task fine-tuning doesn't delete a faculty; it biases the model toward a narrow output habit and away from eliciting what it can already do. That also fits findings that apparent reasoning 'collapses' are often execution limits — the model knows the algorithm but can't run it step-by-step at scale — rather than the loss of reasoning itself Are reasoning model collapses really failures of reasoning?.
The most underappreciated cost shows up at the edges of behavior. Fine-tuning *for* reasoning can backfire on calibration: models optimized to always produce a complete reasoned answer become roughly 24% worse at appropriately saying 'I don't know,' because the training signal punishes abstention Does reasoning fine-tuning make models worse at declining to answer?. So even when fine-tuning improves the thing you measured, it can quietly erode the judgment you didn't.
What's interesting is how avoidable much of this seems. Several approaches improve reasoning with little or no fine-tuning at all — steering verbosity through a single activation direction Can we steer reasoning toward brevity without retraining?, penalizing premature thought-switching purely at decoding time Do reasoning models switch between ideas too frequently?, or using preference-based training like DPO with explicit wrong examples to fix the rigid output failures that plain SFT introduces Can small models match large models on function calling?. The takeaway for a curious reader: fine-tuning doesn't have to cost you reasoning, but the default recipe — optimize for final-answer accuracy — tends to, and your benchmarks won't warn you because they only check whether the answer was right, not whether the model actually reasoned its way there.
Sources 9 notes
Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.
Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.
Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
Models confined to text-only generation cannot execute multi-step procedures at scale, even when they know the underlying algorithm. Tool-enabled models solve problems beyond the supposed reasoning cliff, suggesting the bottleneck is procedural execution bandwidth.
Models optimized for reasoning performance answer questions more often but express unwarranted confidence and fail to abstain appropriately. The training signal rewards complete answers, systematically punishing 'I don't know' responses.
Activation-Steered Compression extracts a single vector from 50 paired examples to reduce chain-of-thought length by 67% while maintaining accuracy and achieving 2.73x speedup. The method is training-free and generalizes across model sizes and domains.
o1-like models frequently abandon reasoning paths mid-exploration, wasting tokens on incomplete approaches. A decoding-only penalty on thought-transition tokens (TIP strategy) discourages switching, improving accuracy on challenging math without model fine-tuning.
Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.