Does fine-tuning on NLI teach inference or amplify shortcuts?

When LLMs are fine-tuned on natural language inference datasets, do they learn genuine reasoning abilities or become better at exploiting statistical patterns in the training data? Understanding this distinction matters for assessing model capabilities.

Synthesis note · 2026-02-21 · sourced from Natural Language Inference

"LLMs are Frequency Pattern Learners in NLI" identifies a consistent frequency bias in NLI datasets: predicates in hypotheses are more frequent in training data than predicates in premises, for positive (entailment) instances. LLMs exploit this pattern. The disturbing finding: fine-tuning on NLI corpora increases reliance on frequency bias rather than decreasing it.

The mechanism connects to a real property of language. Hypernyms (more general terms: "animal") are more frequent than hyponyms (more specific terms: "dog") in natural text. Since upward entailment works from specific to general (SPRINT entails RUN), frequency can be a useful proxy for entailment direction. Fine-tuning teaches models to exploit this proxy more aggressively.

The problem: frequency is a statistical artifact, not a semantic relationship. It works often enough to appear as learning on standard benchmarks but fails on adversarial cases where the frequency pattern disagrees with the actual entailment label. After fine-tuning, LLMs perform significantly worse on adversarial instances than base models — they have learned the shortcut more deeply.

This is a general pattern in the vault: Can models pass tests while missing the actual grammar? shows that surface heuristics enable correct behavior on easy cases while degrading robustness on unusual ones. Fine-tuning amplifies this problem by rewarding the heuristic through the training signal. The model that appears to "learn inference" has learned to use training data statistics more efficiently.

What distinguishes this from the attestation bias (memorization of specific sentences): frequency bias operates at the corpus level — it is a statistical regularity learned from the distribution of natural text, not from specific memorized statements. Both are shortcuts that substitute for inference, but they originate from different levels of the training data.

Inquiring lines that use this note as a source 6

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

15 direct connections · 143 in 2-hop network ·dense cluster Open in graph ↗

Does fine-tuning on NLI teach inference or ampli… Do LLMs predict entailment based on what they memo… Can models pass tests while missing the actual gra… Why do language models fail at communicative optim… Does supervised fine-tuning actually improve reaso… Why do language models struggle with historical le…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Do LLMs predict entailment based on what they memorized? Explores whether language models make entailment decisions by recognizing memorized facts about the hypothesis rather than reasoning through the logical relationship between premise and hypothesis.
the complementary sentence-level bias; both are shortcuts substituting for inference
Can models pass tests while missing the actual grammar? Do language models succeed on grammatical benchmarks by learning surface patterns rather than structural rules? This matters because correct outputs may hide reliance on shallow heuristics that fail on novel structures.
same pattern: surface statistics enabling apparent competence on easy cases
Why do language models fail at communicative optimization? LLMs excel at learning surface statistical patterns from text but struggle with deeper principles of how language achieves efficient communication. What distinguishes these two types of linguistic knowledge?
the broader principle: corpus statistics as substitute for semantic understanding
Does supervised fine-tuning actually improve reasoning quality? While SFT boosts final-answer accuracy, does it degrade the quality and informativeness of the reasoning steps that justify those answers? This matters for high-stakes domains requiring auditable decision-making.
cross-domain parallel: SFT amplifies accuracy-correlated shortcuts (domain patterns) at the cost of reasoning quality; same fine-tuning mechanism operating on different training distribution features
Why do language models struggle with historical legal cases? Explores whether LLMs' training data recency bias creates systematic performance degradation on older cases, and what this reveals about how models represent temporal information in specialized domains.
same mechanism at a different axis: fine-tuning amplifies temporal recency distribution rather than frequency distribution

Does fine-tuning on NLI teach inference or amplify shortcuts?

Related concepts in this collection 5

Related papers in this collection 8

Search by related questions 4