Does SFT degrade reasoning quality while improving domain accuracy?

This explores whether supervised fine-tuning (SFT) trades genuine reasoning for higher benchmark scores — making models more accurate but less able to actually think their way to answers.

This explores whether supervised fine-tuning (SFT) trades genuine reasoning for higher benchmark scores. The corpus answers with an unusually clear yes — and then explains the mechanism behind it. The headline finding is that SFT raises final-answer accuracy while cutting reasoning quality by an average of 38.9%, measured by how much each reasoning step actually informs the answer Does supervised fine-tuning actually improve reasoning quality? Does supervised fine-tuning improve reasoning or just answers?. The catch is that standard metrics never see this, because they only check whether the last token is correct. The model learns to reach the right answer through pattern-matching shortcuts and post-hoc rationalization rather than working through the problem.

What makes this more than a single result is that several independent lines of work converge on the same pattern from different angles. One study runs three separate faithfulness tests — chopping reasoning short, paraphrasing it, swapping in filler — and finds that after fine-tuning, the final answer stays the same far more often, meaning the visible reasoning chain has become decorative rather than load-bearing Does fine-tuning disconnect reasoning steps from final answers?. On optimization problems, SFT teaches models to produce outputs that *look* right — clean JSON, valid identifiers, the expected sections — without making the solutions physically feasible Does supervised fine-tuning actually improve reasoning on optimization problems?. The model is learning the surface form of a good answer, not the reasoning that would generate one.

The deeper insight you might not expect is that this isn't a bug in SFT specifically — it's a property of how these models acquire reasoning at all. Logically invalid chain-of-thought examples turn out to work nearly as well as valid ones, which means the gains come from the *form* of reasoning, not its logical correctness Does logical validity actually drive chain-of-thought gains?. And when reasoning is pushed outside its training distribution, models keep producing fluent, confident chains that are quietly logically broken Does chain-of-thought reasoning actually generalize beyond training data?. So SFT's failure mode is really an amplified version of something baked in: these systems are very good at imitating the appearance of reasoning, and accuracy gains can ride entirely on that imitation.

There's an even stranger wrinkle. Some models genuinely compute the correct answer in their early layers, then actively overwrite that computation in later layers to emit format-compliant filler — the real reasoning is recoverable from lower-ranked predictions but never surfaces Do transformers hide reasoning before producing filler tokens?. This suggests the gap between "looks like reasoning" and "is reasoning" can open up *inside a single forward pass*, not just across training runs.

The practical takeaway cuts against intuition: a higher domain-accuracy score after fine-tuning is not evidence that a model reasons better, and may be evidence it reasons worse while hiding it better. If you care about auditability — being able to trust *why* a model reached an answer, not just *that* it did — the corpus suggests watching reasoning-quality and faithfulness metrics directly, because the accuracy number will actively mislead you. Worth noting the contrast with reinforcement learning approaches, where shorter, simpler reasoning chains tend to *emerge* from reward signals as models improve Why does chain of thought accuracy eventually decline with length? — a hint that how you train, not just whether you train, decides whether reasoning stays real.

Sources 8 notes

Does supervised fine-tuning actually improve reasoning quality?

SFT improves final-answer accuracy but reduces reasoning informativeness by 38.9% on average. Models reach correct answers through pattern-matching shortcuts rather than genuine inferential reasoning, becoming less auditable despite higher accuracy scores.

Does supervised fine-tuning improve reasoning or just answers?

Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.

Does fine-tuning disconnect reasoning steps from final answers?

Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.

Does supervised fine-tuning actually improve reasoning on optimization problems?

Supervised fine-tuning makes model outputs look correct—proper JSON structure, valid identifiers, expected sections—without making them physically feasible. The model learns surface features of solutions, not the reasoning to construct valid ones.

Does logical validity actually drive chain-of-thought gains?

Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

Do transformers hide reasoning before producing filler tokens?

Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.

Why does chain of thought accuracy eventually decline with length?

Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-evaluating whether supervised fine-tuning (SFT) trades genuine reasoning for domain accuracy. A curated library (spanning 2023–2025) made these dated claims — test whether newer models, methods, and tooling have since relaxed or overturned them:

**What a curated library found — and when (dated claims, not current truth):**
• SFT raises final-answer accuracy while cutting reasoning quality by ~38.9% (measured by information gain per step) (2024–2025)
• Reasoning steps become decorative post-hoc rationalizations: swapping, paraphrasing, or truncating them leaves final answers unchanged (2024)
• Logically invalid chain-of-thought examples perform nearly as well as valid ones, meaning gains ride on *form*, not correctness (2023)
• Models compute correct answers in early layers, then actively overwrite them in later layers to emit format-compliant output (2024–2025)
• Shorter, simpler reasoning chains emerge under RL reward signals as models improve, hinting that training method (not just whether to train) determines if reasoning stays real (2025)

**Anchor papers (verify; mind their dates):**
• arXiv:2307.10573 (2023) — Invalid Logic, Equivalent Gains
• arXiv:2411.15382 (2024) — On the Impact of Fine-Tuning on Chain-of-Thought Reasoning
• arXiv:2412.04537 (2024) — Understanding Hidden Computations in Chain-of-Thought Reasoning
• arXiv:2508.01191 (2025) — Is Chain-of-Thought Reasoning of LLMs a Mirage?

**Your task:**
(1) **RE-TEST EACH CONSTRAINT.** For the 38.9% reasoning-quality drop and faithfulness degradation: have newer model scales (o1, o3, newer-4o variants), test-time compute budgets (extended thinking, token-scaling experiments), or improved evaluation harnesses (e.g., causal tracing, mechanistic interpretability tools) since *relaxed* or *overturned* these findings? Separate the durable claim (SFT can degrade reasoning fidelity) from perishable limitations (size/architecture-specific, or detectable only with old metrics). Flag where the constraint still appears to hold.
(2) **Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months.** Has recent work (esp. in distillation, curriculum learning, or hybrid SFT+RL) shown that reasoning quality and domain accuracy *can* co-improve? Or has mechanistic work (2025) revealed the layer-by-layer overwrite finding is less universal than it appears?
(3) **Propose 2 research questions that ASSUME the regime may have moved:** e.g., "Under extended inference budgets, does SFT still degrade reasoning fidelity, or does test-time compute recover it?" or "Can SFT trained on reasoning-faithful examples (filtered by a mechanistic probe) retain faithfulness while improving accuracy?"

**Cite arXiv IDs; flag anything you cannot ground in a real paper.**

Does SFT degrade reasoning quality while improving domain accuracy?

Sources 8 notes

Next inquiring lines