Can fine-tuning ever teach semantic inference instead of amplifying training shortcuts?
This explores whether fine-tuning can actually teach a model to reason about meaning — or whether it mostly sharpens the surface shortcuts a model already leans on (word frequency, output format, answer-matching), and what alternative training setups break that pattern.
This explores whether fine-tuning can actually teach semantic inference rather than just amplifying shortcuts. The corpus is sobering on the default case and more hopeful on the alternatives. The clearest indictment is that fine-tuning on natural language inference makes models lean *harder* on a frequency trick — preferring whichever word appears more often in the corpus — rather than learning what actually entails what; the giveaway is that they get worse on adversarial cases where frequency and truth disagree Does fine-tuning on NLI teach inference or amplify shortcuts?. The same pattern recurs across very different setups: standard supervised fine-tuning raises benchmark accuracy while cutting the actual inferential content of the reasoning by nearly 39%, so the model arrives at right answers through post-hoc rationalization Does supervised fine-tuning improve reasoning or just answers?, and even RL fine-tuning often just sharpens template-matching — performance collapses on slightly out-of-distribution variants of the same problem Do fine-tuned language models actually learn optimization procedures?.
The most unsettling result in the collection suggests the shortcut runs deeper than we'd guess: instruction tuning works almost as well when you train on *semantically empty or deliberately wrong* instructions as on correct ones (43% vs. a 42.6% baseline). What transfers isn't task understanding — it's familiarity with the shape of the output Does instruction tuning teach task understanding or output format?. A related line shows fine-tuning actively loosens the causal link between a model's reasoning steps and its answer: cut the chain short, paraphrase it, or stuff it with filler, and the answer barely changes — the reasoning has become performance, not function Does fine-tuning disconnect reasoning steps from final answers?.
So what flips it? The corpus keeps pointing to the same lever: train on the *quality of the inference*, not just the correctness of the token. Reinforcement learning from augmented generation rewards explanation rationality alongside the answer, cycling between seeing and not seeing the source until coherent knowledge structures internalize — and it beats SFT precisely because it stops optimizing token-level correctness Can reinforcement learning embed domain knowledge more effectively than supervised fine-tuning?. DPO does something similar by feeding explicit *wrong* examples, which directly targets the failure modes plain SFT papers over Can small models match large models on function calling?. And RLVR appears to work by adjusting only the ~20% of high-entropy 'forking' tokens where a real decision happens — evidence that genuine reasoning improvement lives in a specific, identifiable signal rather than in blanket imitation Do high-entropy tokens drive reasoning model improvements?.
There's also a structural escape hatch worth knowing about. Some methods avoid corrupting the model's knowledge at all: proxy-tuning steers behavior at decoding time and preserves pretrained knowledge far better, because direct fine-tuning damages the lower layers where facts are stored Can decoding-time tuning preserve knowledge better than weight fine-tuning?. And Quiet-STaR teaches a model to generate rationales at every token during pretraining on ordinary internet text — letting reasoning competence emerge as a *byproduct* of better language modeling rather than from a labeled inference dataset Can models learn reasoning from predicting any text?.
The deeper lesson the collection leaves you with: 'shortcut vs. inference' isn't really about fine-tuning yes-or-no — it's about what your reward measures. When training grades only the final answer, models learn the cheapest route to that answer, which is almost always a surface correlate. When training grades the *reasoning* — through verifiable explanation, negative examples, or entropy-targeted signals — semantic inference becomes the thing being selected for. Two adjacent findings sharpen the boundary: argument-quality judgment won't transfer from labeled examples alone but does improve when you hand the model an explicit theoretical framework Can models learn argument quality from labeled examples alone?, and prompting can only reorganize knowledge already present, never inject what's missing Can prompt optimization teach models knowledge they lack? — a reminder that some of what looks like 'failure to learn inference' is really a ceiling set long before fine-tuning began.
Sources 12 notes
NLI fine-tuning increases LLM reliance on corpus-level frequency patterns (hypernyms more common than hyponyms) rather than semantic relationships. Models perform worse on adversarial cases where frequency patterns contradict actual entailment labels, showing the shortcut was learned more deeply.
Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.
Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.
Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.
Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.
RLAG rewards both answer accuracy and explanation rationality by cycling between augmented and unaugmented generation, progressively internalizing coherent knowledge structures. This outperforms SFT because it prioritizes reasoning quality over token-level correctness.
Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.
Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.
Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.
Quiet-STaR trains language models to generate rationales at every token position during pretraining on arbitrary internet text, enabling general reasoning without task-specific datasets. Rationale quality is judged by predictive accuracy rather than labeled correctness, allowing reasoning competence to emerge as a side effect of improved language modeling.
Fine-tuning on labeled examples fails to transfer quality criteria to new argument types. Models learn surface patterns rather than principled criteria. Explicit instruction using frameworks like RATIO or QOAM significantly improves performance and generalization.
Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.