How does data quality mismatch create reasoning degradation in supervised fine-tuning?

This explores a counterintuitive finding in the corpus: that fine-tuning can make a model's answers look better while making its actual reasoning worse — and what role the *content* and *difficulty* of training data plays in that gap.

This explores a counterintuitive finding in the corpus: that fine-tuning can make a model's answers look better while making its actual reasoning worse — and what role the content and difficulty of training data plays in that gap.

The sharpest version of the problem is what you might call the accuracy trap. Supervised fine-tuning reliably raises benchmark scores while quietly cutting the quality of the reasoning steps that get there — by roughly 39% on an "information gain" measure Does supervised fine-tuning improve reasoning or just answers? Does supervised fine-tuning actually improve reasoning quality?. The model learns to reach correct answers through pattern-matching shortcuts and post-hoc rationalization rather than genuine inference. Standard metrics miss this entirely because they only check whether the final answer is right. A companion finding shows the reasoning chain becomes causally disconnected from the answer: you can truncate it, paraphrase it, or swap in filler, and the model spits out the same answer anyway — the reasoning has become performance, not function Does fine-tuning disconnect reasoning steps from final answers?.

Here's where "data quality mismatch" gets surprising. Several notes suggest the model often isn't learning the *content* of your training data at all — it's learning the *shape* of the output. Models trained on semantically empty or even deliberately wrong instructions perform about as well as those trained on correct ones (43% vs. a 42.6% baseline); what transfers is knowledge of the output space, not task understanding Does instruction tuning teach task understanding or output format?. The same holds for reasoning traces themselves: systematically corrupted, irrelevant traces teach roughly as well as correct ones, implying the traces act as computational scaffolding rather than meaningful steps Do reasoning traces need to be semantically correct?. On optimization problems, SFT makes outputs *look* correct — valid JSON, right sections — without making them physically feasible Does supervised fine-tuning actually improve reasoning on optimization problems?. So the "degradation" isn't that bad data poisons good reasoning; it's that SFT was teaching surface form all along, and once you measure reasoning directly the illusion breaks.

That reframes what a *mismatch* even is. If you actually want criteria to transfer — say, judging argument quality — labeled examples alone fail, because the model learns surface patterns instead of principles; you need explicit theoretical frameworks baked into instruction Can models learn argument quality from labeled examples alone?. Difficulty mismatch bites too: training on problems that are too hard for the model rewards rare accidental successes as if they were skill, amplifying degenerate shortcuts that then contaminate capabilities the model already had Do overly hard RLVR samples actually harm model capabilities?. And a related RL result shows fine-tuning sharpening memorization rather than installing procedures — performance collapses on out-of-distribution variants of the same problem Do fine-tuned language models actually learn optimization procedures?.

The quietly hopeful counter-thread is that the damage is largely about *where* fine-tuning writes. Direct fine-tuning corrupts knowledge stored in lower layers, but decoding-time proxy-tuning closes most of the alignment gap while leaving base weights untouched — shifting style and reasoning without overwriting stored knowledge Can decoding-time tuning preserve knowledge better than weight fine-tuning?. And LIMA's finding that 1,000 carefully curated examples rival datasets orders of magnitude larger points the same direction: post-training mostly *activates* capabilities the pretrained model already has rather than building new ones Can careful curation replace massive alignment datasets?. The thing you didn't know you wanted to know: reasoning degradation from SFT may be less about feeding the model wrong answers and more about how little of your data's meaning it was ever absorbing — which is why curation and where-you-tune matter more than sheer volume.

Sources 11 notes

Does supervised fine-tuning improve reasoning or just answers?

Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.

Does supervised fine-tuning actually improve reasoning quality?

SFT improves final-answer accuracy but reduces reasoning informativeness by 38.9% on average. Models reach correct answers through pattern-matching shortcuts rather than genuine inferential reasoning, becoming less auditable despite higher accuracy scores.

Does fine-tuning disconnect reasoning steps from final answers?

Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

Do reasoning traces need to be semantically correct?

Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.

Does supervised fine-tuning actually improve reasoning on optimization problems?

Supervised fine-tuning makes model outputs look correct—proper JSON structure, valid identifiers, expected sections—without making them physically feasible. The model learns surface features of solutions, not the reasoning to construct valid ones.

Can models learn argument quality from labeled examples alone?

Fine-tuning on labeled examples fails to transfer quality criteria to new argument types. Models learn surface patterns rather than principled criteria. Explicit instruction using frameworks like RATIO or QOAM significantly improves performance and generalization.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Do fine-tuned language models actually learn optimization procedures?

Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.

Can careful curation replace massive alignment datasets?

LIMA demonstrates that 1000 carefully curated examples fine-tuned on a strong pretrained model achieve competitive alignment performance with models trained on orders of magnitude more data, showing that post-training activates existing capabilities rather than building new ones.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an LLM researcher stress-testing claims about reasoning degradation under supervised fine-tuning. The question remains: does data quality mismatch genuinely degrade reasoning, or does SFT simply expose that fine-tuning was always teaching surface form?

What a curated library found — and when (2023–2026, dated claims, not current truth):

• Supervised fine-tuning raises benchmark accuracy while degrading reasoning chain quality by ~39% on information-gain metrics, and reasoning becomes causally disconnected from answers (~2024–2025).
• Models trained on semantically empty or deliberately wrong instructions perform comparably to those trained on correct ones (43% vs. 42.6%), suggesting SFT learns output-space shape, not task content (~2023–2024).
• Corrupted reasoning traces teach roughly as well as correct traces; traces function as scaffolding, not meaningful inference steps (~2024–2025).
• Fine-tuning sharpens memorization over procedures; performance collapses on out-of-distribution variants, and overly-hard RL samples amplify degenerate shortcuts (~2024–2026).
• Proxy-tuning at decode time preserves pretrained knowledge better than direct fine-tuning; LIMA's ~1,000 curated examples rival much larger datasets, implying post-training *activates* latent capability (~2023–2024).

Anchor papers (verify; mind their dates):
• arXiv:2305.11383 (2023) — Do Models Really Learn to Follow Instructions?
• arXiv:2411.15382 (2024) — On the Impact of Fine-Tuning on Chain-of-Thought Reasoning
• arXiv:2504.09762 (2025) — Stop Anthropomorphizing Intermediate Tokens as Reasoning/Thinking Traces!
• arXiv:2605.28388 (2026) — Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs

Your task:

(1) RE-TEST EACH CONSTRAINT. For the 39% reasoning-quality drop, the surface-form hypothesis, and the memorization-over-procedure finding: has newer work (last 6 months, post-2026) shown that scaling, better tokenization, mechanistic steering, or novel RL formulations have *relaxed* these failures? Or do they still appear robust? Plainly separate the durable question ("does SFT teach shallow patterns?") from perishable limitations ("all SFT fails at reasoning").

(2) Surface the strongest CONTRADICTING work: look for papers arguing that reasoning *does* improve under SFT, or that data quality *does* transfer robustly. Where do they disagree with the library's consensus?

(3) Propose 2 research questions that assume the regime may have shifted—e.g., whether mechanistic interpretability has revealed knobs to preserve reasoning, or whether new RL+SFT hybrids have solved the memorization trap.

Cite arXiv IDs; flag anything you cannot ground in a real paper.

How does data quality mismatch create reasoning degradation in supervised fine-tuning?

Sources 11 notes

Next inquiring lines