Why does fine-tuning degrade reasoning quality even as accuracy improves?
This explores why supervised fine-tuning can lift benchmark accuracy while the underlying reasoning gets worse — the model reaches right answers for shallower reasons.
This explores why fine-tuning can raise benchmark accuracy while the reasoning behind those answers actually degrades — the model gets the answer right but for shallower reasons. The corpus is unusually direct on this: supervised fine-tuning raises final-answer accuracy while cutting a model's reasoning informativeness (its 'Information Gain') by about 38.9 percent Does supervised fine-tuning improve reasoning or just answers? Does supervised fine-tuning actually improve reasoning quality?. The mechanism is post-hoc rationalization: instead of reaching the answer through genuine inferential steps, the fine-tuned model pattern-matches to a correct answer and then produces reasoning text that decorates it. Standard metrics miss this entirely because they only score whether the final answer is right.
The sharpest evidence that the reasoning becomes ornamental rather than functional comes from faithfulness testing. When you cut a fine-tuned model's reasoning chain short, paraphrase it, or splice in filler tokens, the final answer stays the same far more often than before fine-tuning Does fine-tuning disconnect reasoning steps from final answers?. In other words, the steps stop causally driving the output — the chain of thought becomes performative theater. This is why accuracy and reasoning quality can move in opposite directions: the answer was never really riding on the visible reasoning in the first place.
Why would training push a model in this direction? A cluster of notes argues that post-training doesn't create reasoning so much as select and route it. Base models already carry latent reasoning capability that minimal training merely elicits Do base models already contain hidden reasoning ability?, and RL post-training largely teaches a model *when* to deploy reasoning rather than *how* to reason Does RL post-training create reasoning or just deploy it?. Fine-tuning toward a narrow target distribution optimizes for the cheapest path to the rewarded answer — and the cheapest path is often a memorized shortcut, not a faithful derivation. The capability isn't destroyed; the training just stops requiring the model to use it.
There's a related forgetting story worth knowing about. Fine-tuning for new behaviors can erode pre-trained reasoning, which is why some methods freeze the main model entirely and delegate the new 'thinking' to a small auxiliary module — SoftCoT preserves the frozen backbone's reasoning precisely to dodge this catastrophic-forgetting trade-off Can continuous reasoning avoid forgetting in instruction-tuned models?. The common thread: when you optimize a model's weights against a final-answer signal, you put pressure on the very representations that did the reasoning work.
The quietly surprising takeaway is that better reasoning often needs *less* intervention, not more. Optimal chain-of-thought length follows an inverted-U — more capable models prefer shorter chains, and accuracy actually declines past a critical thinking-token threshold Why does chain of thought accuracy eventually decline with length? Does more thinking time always improve reasoning accuracy?. And several of the most effective reasoning fixes don't touch the weights at all: penalizing premature thought-switching at decode time recovers accuracy without retraining Do reasoning models switch between ideas too frequently? Why do reasoning models abandon promising solution paths?, and a single steering vector can compress reasoning verbosity by two-thirds while holding accuracy steady Can we steer reasoning toward brevity without retraining?. If reasoning lives in directions you can steer without fine-tuning, that helps explain why fine-tuning — which reshapes everything at once — can blunt the very thing it's trying to improve.
Sources 11 notes
Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.
SFT improves final-answer accuracy but reduces reasoning informativeness by 38.9% on average. Models reach correct answers through pattern-matching shortcuts rather than genuine inferential reasoning, becoming less auditable despite higher accuracy scores.
Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.
SoftCoT avoids catastrophic forgetting by keeping the main LLM frozen while delegating soft thought generation to a small auxiliary model. This architectural separation maintains pre-trained knowledge while enabling continuous reasoning.
Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.
Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.
o1-like models frequently abandon reasoning paths mid-exploration, wasting tokens on incomplete approaches. A decoding-only penalty on thought-transition tokens (TIP strategy) discourages switching, improving accuracy on challenging math without model fine-tuning.
Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.
Activation-Steered Compression extracts a single vector from 50 paired examples to reduce chain-of-thought length by 67% while maintaining accuracy and achieving 2.73x speedup. The method is training-free and generalizes across model sizes and domains.