INQUIRING LINE

Why does domain accuracy improve while reasoning quality degrades after supervised fine-tuning?

This explores why supervised fine-tuning (SFT) can lift a model's score on domain benchmarks while quietly hollowing out the reasoning that's supposed to get it there — and what the corpus says is actually happening underneath the metric.


This explores why supervised fine-tuning can raise a model's accuracy on domain benchmarks while the quality of its reasoning gets worse — a split that standard scoring hides because it only checks the final answer. The clearest measurement of the gap comes from work showing SFT cuts a model's "Information Gain" by 38.9% even as accuracy climbs: the model learns to land on correct answers through post-hoc rationalization rather than genuine inferential steps Does supervised fine-tuning improve reasoning or just answers? Does supervised fine-tuning actually improve reasoning quality?. In other words, you're rewarding the destination, so the model optimizes the destination and lets the journey go slack.

The mechanism becomes vivid when you test whether the reasoning steps still *cause* the answer. Faithfulness experiments — terminating a chain early, paraphrasing it, or swapping in filler — find that fine-tuned models reach the same answer regardless, meaning the visible reasoning has become performative rather than functional Does fine-tuning disconnect reasoning steps from final answers?. This connects to a surprising finding from the other direction: models trained on deliberately corrupted, irrelevant reasoning traces score about as well as those trained on correct ones Do reasoning traces need to be semantically correct?. If garbage traces work as well as good ones, the trace was never carrying the inferential load — it's computational scaffolding, and SFT happily learns the scaffold's shape without the logic inside it.

The deeper reason SFT does this is that token-level imitation teaches surface patterns, not principled criteria. When models are fine-tuned on labeled examples to judge argument quality, they pick up the surface signature of "good" arguments but fail to transfer the actual criteria to new argument types — explicit theoretical frameworks are needed to get real generalization Can models learn argument quality from labeled examples alone?. Pattern-matching is exactly what raises in-domain accuracy and exactly what degrades transferable reasoning. The broader survey of adaptation methods frames this as a built-in tradeoff: every domain-training technique has a "sweet spot" where visible performance gains come bundled with hidden costs to reasoning faithfulness and format flexibility How do domain training techniques actually reshape model behavior?.

What's interesting is that the alternatives point at the same root cause from the reward side. Reinforcement-learning approaches that reward *explanation rationality* alongside answer correctness internalize coherent knowledge structures and outperform SFT precisely because they don't optimize tokens in isolation Can reinforcement learning embed domain knowledge more effectively than supervised fine-tuning?. Using the model's own answer confidence as a reward strengthens step-by-step reasoning while repairing calibration Can model confidence work as a reward signal for reasoning?, and RL can grow genuinely complex domain reasoning from nothing but simple accuracy signals Can simple rewards alone teach complex domain reasoning?. The contrast tells you the SFT degradation isn't inevitable — it's an artifact of the training objective.

The thing you didn't know you wanted to know: post-training may not be *creating* reasoning at all. Multiple independent methods show base models already carry latent reasoning capability that minimal training merely elicits — the bottleneck is selection, not acquisition Do base models already contain hidden reasoning ability?. Read through that lens, SFT's failure mode is a selection failure: by rewarding only correct final answers, it selects for the shortest path to the answer rather than the reasoning path, and the model — which improves by trimming chains toward simplicity anyway Why does chain of thought accuracy eventually decline with length? — obliges by quietly abandoning the inference it was supposed to be learning.


Sources 11 notes

Does supervised fine-tuning improve reasoning or just answers?

Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.

Does supervised fine-tuning actually improve reasoning quality?

SFT improves final-answer accuracy but reduces reasoning informativeness by 38.9% on average. Models reach correct answers through pattern-matching shortcuts rather than genuine inferential reasoning, becoming less auditable despite higher accuracy scores.

Does fine-tuning disconnect reasoning steps from final answers?

Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.

Do reasoning traces need to be semantically correct?

Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.

Can models learn argument quality from labeled examples alone?

Fine-tuning on labeled examples fails to transfer quality criteria to new argument types. Models learn surface patterns rather than principled criteria. Explicit instruction using frameworks like RATIO or QOAM significantly improves performance and generalization.

How do domain training techniques actually reshape model behavior?

Research shows every adaptation method—from parameter-efficient tuning to knowledge graph curricula—has optimal conditions tied to specific domains. The key finding: visible benefits like performance gains often come with hidden degradation in reasoning faithfulness, capability transfer, and format flexibility.

Can reinforcement learning embed domain knowledge more effectively than supervised fine-tuning?

RLAG rewards both answer accuracy and explanation rationality by cycling between augmented and unaugmented generation, progressively internalizing coherent knowledge structures. This outperforms SFT because it prioritizes reasoning quality over token-level correctness.

Can model confidence work as a reward signal for reasoning?

RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.

Can simple rewards alone teach complex domain reasoning?

Medical AI systems and o3 demonstrate that sophisticated domain reasoning emerges naturally from RL training on difficult problems with only basic accuracy signals, without requiring explicit chain-of-thought distillation from teacher models.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Why does chain of thought accuracy eventually decline with length?

Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.

Next inquiring lines