Does fine-tuning on NLI tasks amplify or reduce frequency bias in language models?
This explores a sharp result about what happens when you fine-tune a language model on natural language inference (NLI) tasks — and whether that training teaches real reasoning or just sharpens a statistical shortcut.
This explores whether fine-tuning on NLI actually teaches a model to reason about entailment, or whether it just deepens the model's reliance on how often words appear together. The corpus has a direct answer, and it's the uncomfortable one: NLI fine-tuning *amplifies* frequency bias rather than reducing it Does fine-tuning on NLI teach inference or amplify shortcuts?. Because hypernyms (general words like 'animal') show up more often in text than hyponyms (specific words like 'spaniel'), models learn to lean on that frequency gap as a proxy for entailment. The tell is adversarial cases: when frequency points one way and the actual entailment label points the other, fine-tuned models perform *worse* — meaning the training didn't correct the shortcut, it carved it deeper.
What makes this more than a one-off finding is that the same frequency-tracking habit shows up across completely different tasks. Models systematically prefer higher-frequency surface phrasings over rare-but-equivalent paraphrases — in math, translation, commonsense, and tool calling alike Do language models really understand meaning or just surface frequency?. So NLI fine-tuning isn't introducing a new flaw; it's pouring fuel on a mechanism that's already the model's default. The model is tracking statistical mass from pretraining and dressing it up as meaning-recognition.
That connects to a deeper question the corpus keeps circling: where do these biases actually live, and can fine-tuning move them? The evidence says biases are *planted in pretraining and only swayed — not removed — by fine-tuning* Where do cognitive biases in language models come from?. That reframes the NLI result entirely. Fine-tuning didn't fail to teach inference because the recipe was wrong; it failed because fine-tuning can't reach the layer where the frequency prior was formed. You're nudging a surface, not rewriting a foundation.
The pattern rhymes with other 'fine-tuning teaches the wrong thing' findings. RL fine-tuning, for instance, tends to sharpen memorized template-matching rather than installing a genuine reasoning procedure — out-of-distribution variants expose the gap Do fine-tuned language models actually learn optimization procedures?. And the broader linguistic picture is that statistical learning captures surface patterns but stumbles on deep grammatical structure as complexity rises Why do large language models fail at complex linguistic tasks?. NLI sits right at that fault line: entailment is a structural, semantic relationship, but the model keeps reaching for the surface frequency signal because that's what's cheapest.
The thing worth walking away with: 'fine-tuning on task X' doesn't reliably teach the *skill* behind task X — it often just amplifies whatever shortcut already correlates with the right answer in your training data. If you want to know whether a model learned inference or learned frequency, you have to build adversarial cases where the two disagree. Without that test, an amplified shortcut looks exactly like improved reasoning on the scoreboard.
Sources 5 notes
NLI fine-tuning increases LLM reliance on corpus-level frequency patterns (hypernyms more common than hyponyms) rather than semantic relationships. Models perform worse on adversarial cases where frequency patterns contradict actual entailment labels, showing the shortcut was learned more deeply.
LLMs show consistent preference for higher-frequency surface forms over semantically equivalent rare paraphrases across math, machine translation, commonsense reasoning, and tool calling. This suggests models track statistical mass from pretraining rather than meaning-recognition as their primary mechanism.
A causal experiment using random-seed variation and cross-tuning showed that models sharing a pretrained backbone exhibit similar bias patterns regardless of finetuning data. Biases are planted during pretraining and merely swayed by instruction tuning.
Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.
Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.