INQUIRING LINE

Why does NLI fine-tuning amplify frequency bias instead of teaching inference?

This explores why teaching a model the inference task (NLI = natural language inference: deciding whether one sentence entails another) ends up sharpening a counting shortcut — preferring whichever word is more common in the corpus — instead of installing genuine semantic reasoning.


This explores why fine-tuning on an inference task can deepen a frequency shortcut rather than teach inference, and the corpus has a surprisingly unified answer: fine-tuning doesn't write new reasoning into a model so much as it amplifies whatever the model already leaned on. The direct finding is that NLI fine-tuning makes models rely *more* on corpus-level frequency patterns — hypernyms ("animal") tend to appear more often than hyponyms ("dog"), and the model learns to ride that statistical gradient instead of checking actual entailment. The tell is adversarial cases: when frequency and the true label disagree, the fine-tuned model performs *worse*, which means the shortcut got reinforced, not corrected Does fine-tuning on NLI teach inference or amplify shortcuts?.

Why would gradient descent prefer the shortcut? Because the shortcut is already there before fine-tuning ever starts. A causal study varying random seeds and swapping tuning data found that models sharing a pretrained backbone keep the same bias fingerprint no matter what they're fine-tuned on — biases are *planted* in pretraining and only *nudged* afterward Where do cognitive biases in language models come from?. Fine-tuning operates on a model that has already decided frequency is a good predictor, so the cheapest way to lower training loss is to lean harder on that prior rather than build a new entailment-checking circuit.

This is part of a broader pattern where fine-tuning sharpens what exists instead of teaching procedures. RL-tuned models look like they reason but collapse on out-of-distribution variants of the same problem, revealing template-matching rather than an installed algorithm Do fine-tuned language models actually learn optimization procedures?. And RL post-training tends to converge on a single dominant *format* already present in pretraining, suppressing alternatives within the first epoch — again, amplification of a pre-existing distribution, not creation of new capability Does RL training collapse format diversity in pretrained models?. NLI frequency bias is the same story told with statistics instead of formatting.

There's a deeper reason the shortcut is so sticky: these models reason semantically, not symbolically. When you decouple semantic content from the logical task — give the correct rule but strip the familiar word associations — performance collapses, because the model is manipulating token associations rather than formal relations Do large language models reason symbolically or semantically?. Entailment *is* a formal relation, so a model built to follow associations will substitute the nearest associative proxy it has, and "which word is more common" is exactly such a proxy. The same dynamic shows up when strong training priors simply override contradicting in-context information — textual prompting can't dislodge the prior; you need causal intervention in the representations themselves Why do language models ignore information in their context?.

The quietly useful takeaway: if biases live in pretraining and fine-tuning only modulates them, then fixing a learned shortcut with more task-specific fine-tuning is pushing on the wrong layer. The corpus points elsewhere — toward methods that change the *signal* rather than the data, like using model confidence as an intrinsic reward to rank reasoning traces Can model confidence work as a reward signal for reasoning?, or supplying explicit negative examples that target the exact failure mode rather than hoping more positive examples crowd it out Can small models match large models on function calling?. Frequency bias survives fine-tuning because fine-tuning was never the place it was born.


Sources 8 notes

Does fine-tuning on NLI teach inference or amplify shortcuts?

NLI fine-tuning increases LLM reliance on corpus-level frequency patterns (hypernyms more common than hyponyms) rather than semantic relationships. Models perform worse on adversarial cases where frequency patterns contradict actual entailment labels, showing the shortcut was learned more deeply.

Where do cognitive biases in language models come from?

A causal experiment using random-seed variation and cross-tuning showed that models sharing a pretrained backbone exhibit similar bias patterns regardless of finetuning data. Biases are planted during pretraining and merely swayed by instruction tuning.

Do fine-tuned language models actually learn optimization procedures?

Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Can model confidence work as a reward signal for reasoning?

RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.

Can small models match large models on function calling?

Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an NLI robustness researcher. The precise question: *Can fine-tuning ever teach genuine inference, or does it always amplify pre-existing shortcuts?* This remains open despite recent work.

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat as perishable until re-tested:
- Fine-tuning on NLI amplifies frequency bias (e.g., hypernym–hyponym gradients) rather than installing entailment reasoning; models perform *worse* on adversarial cases where frequency contradicts the true label (~2025, arXiv:2505.21011).
- Cognitive biases are planted in pretraining; fine-tuning only modulates them, not overwrite them. Models sharing a backbone retain the same bias fingerprint regardless of fine-tuning data (~2025, arXiv:2507.07186).
- RL post-training converges on a single dominant format already present in pretraining within the first epoch, suppressing alternatives — amplification, not creation (~2025, arXiv:2504.07912).
- LLMs reason semantically (manipulating token associations) rather than symbolically (formal relations). When semantic content is decoupled from the logical task, performance collapses (~2023, arXiv:2305.14825).
- Training priors override contradicting in-context information; causal intervention in representations is needed, not prompting (~2025, arXiv:2504.09522).

Anchor papers (verify; mind their dates):
- arXiv:2305.14825 (2023): In-context semantic reasoning vs. symbolic reasoning.
- arXiv:2505.21011 (2025): NLI frequency pattern learning.
- arXiv:2507.07186 (2025): Biases planted in pretraining, modulated by fine-tuning.
- arXiv:2504.07912 (2025): RL post-training amplification of pretraining formats.

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, judge whether newer models (GPT-4o, o1, Claude 4), training methods (DPO, IPO, constitutional AI), or tooling (layer-wise intervention, steering, representation editing) have since RELAXED or OVERTURNED it. Separate the durable question ("Can fine-tuning install reasoning?") from perishable limitations ("only frequency shortcuts exist"); cite what resolved each, and flag where constraints still hold.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Whose empirical findings challenge the pretraining-primacy or semantic-reasoning-only thesis?
(3) Propose 2 research questions that ASSUME the regime may have moved — e.g., if newer models *can* learn symbolic entailment under specific conditions, what are those conditions? If causal steering works, can it be scaled to NLI fine-tuning?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines