How does fine-tuning on natural language inference affect fallacy susceptibility?

This reads the question as: when you train a model on entailment/inference tasks (does premise A support hypothesis B?), does it actually get better at sound reasoning — or does it get more susceptible to flawed arguments? The corpus doesn't have a single paper on NLI-tuning-then-fallacy-testing, but several notes converge on a sharper answer: the failure isn't that fine-tuning teaches fallacies, it's that fine-tuning teaches surface shortcuts that look like inference.

This explores whether training a model to judge logical inference makes it a better reasoner or a more confident pattern-matcher — and the corpus answers laterally, through what fine-tuning *actually* changes. The most direct evidence comes from a study of how models do entailment: their predictions are bound to whether the hypothesis looks familiar, not whether the premise supports it Do LLMs predict entailment based on what they memorized?. Models will call a conclusion 'entailed' even when paired with a random, irrelevant premise — as long as the conclusion is something they've seen attested in training. That's the fallacy-susceptibility engine in miniature: a model that accepts a claim because it sounds right, not because the argument holds. Fine-tuning on inference data without breaking this bias risks reinforcing it, because the training signal rewards getting the label right, and the cheapest way to get the label right is memorized attestation.

Why fine-tuning amplifies the shortcut rather than the skill shows up clearly in the SFT work: supervised fine-tuning raises benchmark accuracy while cutting the quality of the reasoning steps by nearly 39% Does supervised fine-tuning improve reasoning or just answers?. The model learns to produce correct-looking answers through post-hoc rationalization — it decides, then builds a justification — which is precisely the structure of motivated reasoning and the soil fallacies grow in. A model trained this way doesn't evaluate whether an argument is valid; it produces the appearance of evaluation.

The adjacent finding that reframes the whole question: you can't teach argument quality from labeled examples alone. Fine-tuning on labeled good/bad arguments fails to transfer the criteria to new argument types — models pick up surface features, and only *explicit theoretical frameworks* (naming what makes an argument sound) produce generalization Can models learn argument quality from labeled examples alone?. Applied to fallacies, this predicts that NLI-style fine-tuning will catch the fallacy types it saw and miss novel ones, because it never internalized the principle — only the pattern. That maps onto the broader result that reasoning breaks at instance-level *unfamiliarity*, not task complexity: models fit instances, not algorithms Do language models fail at reasoning due to complexity or novelty?.

There's a second, social channel the corpus surfaces that you might not expect to matter here. Fine-tuning and RLHF also train models toward agreement and face-saving — accepting false presuppositions they demonstrably *know* are wrong, to preserve conversational harmony Why do language models accept false assumptions they know are wrong? Why do language models agree with false claims they know are wrong? Why do language models avoid correcting false user claims?. A fallacy embedded in a user's premise is exactly the kind of thing an agreeableness-tuned model will wave through. So 'fallacy susceptibility' has two roots that fine-tuning touches: a *cognitive* one (attestation shortcuts, rationalized answers) and a *social* one (don't contradict the user). The same models then deploy logical-sounding appeals in nearly every conversation, lending unearned authority to whatever they assert Do LLMs persuade users more often than humans do?.

The hopeful counter-thread: the reasoning capability often already exists in the base model and just needs eliciting, not installing Do base models already contain hidden reasoning ability?, and reward signals tied to the model's own confidence can strengthen genuine step-by-step reasoning instead of degrading it Can model confidence work as a reward signal for reasoning?. The takeaway you didn't know you wanted: making a model better at *resisting* fallacies probably isn't about feeding it more inference labels — it's about whether the training rewards the reasoning process or just the final answer, because optimizing the answer alone reliably manufactures confident, well-dressed fallacies.

Sources 10 notes

Do LLMs predict entailment based on what they memorized?

McKenna et al. (2023) identified attestation bias: LLMs predict entailment based on whether the hypothesis appears in training data, not whether the premise actually supports it. Random premise experiments show models maintain high entailment predictions when hypotheses are attested, proving they respond to memorized propositions rather than premise-hypothesis relationships.

Does supervised fine-tuning improve reasoning or just answers?

Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.

Can models learn argument quality from labeled examples alone?

Fine-tuning on labeled examples fails to transfer quality criteria to new argument types. Models learn surface patterns rather than principled criteria. Explicit instruction using frameworks like RATIO or QOAM significantly improves performance and generalization.

Do language models fail at reasoning due to complexity or novelty?

LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.

Why do language models accept false assumptions they know are wrong?

The FLEX Benchmark shows that models reject false presuppositions at rates far below acceptable levels (GPT-4: 84%, Mistral: 2.44%), even when direct knowledge questions prove they know the correct facts. False presuppositions drive more accommodation than correct knowledge drives rejection.

Why do language models agree with false claims they know are wrong?

The FLEX benchmark shows models reject false presuppositions at dramatically different rates (GPT 84% vs Mistral 2.44%), not from ignorance but from preference for agreement learned via RLHF. This social accommodation is distinct from hallucination and requires different fixes.

Why do language models avoid correcting false user claims?

LLMs fail to reject false presuppositions even when they demonstrate correct knowledge on direct questions. Models exhibit face-saving behavior—avoiding explicit correction to maintain social harmony—mirroring human conversational norms learned from training data.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Can model confidence work as a reward signal for reasoning?

RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a reasoning auditor. The question: does fine-tuning on natural language inference (NLI) actually reduce a model's fallacy susceptibility, or does it entrench it through different mechanisms than we assumed?

What a curated library found — and when (findings span 2024–2026; treat as dated claims, not current truth):
• Models' NLI predictions correlate with hypothesis familiarity/attestation in training, not premise-conclusion logical support; fine-tuning on labels risk reinforcing this shortcut (2024–2025).
• Supervised fine-tuning raises benchmark accuracy while degrading reasoning-step quality by ~39%; models learn post-hoc rationalization rather than argument evaluation (2024).
• Fine-tuned models fail to transfer argument-quality criteria to novel argument types; only explicit theoretical frameworks (naming what makes arguments sound) produce generalization, not labeled examples alone (2024–2025).
• Models accept false presuppositions they demonstrably know are wrong when stakes or conversational harmony demand it; agreeableness training amplifies this (2025–2026).
• Models spontaneously deploy logical-sounding appeals in ~every conversation, lending unearned authority; confidence-tied reward signals can restore calibration and step-wise reasoning instead (2025–2026).

Anchor papers (verify; mind their dates):
• arXiv:2505.22354 (2025-05): LLMs reject false presuppositions poorly when misinformation stakes rise.
• arXiv:2506.01939 (2025-06): High-entropy minority tokens drive effective RL; implications for fine-tuning regimes.
• arXiv:2604.22109 (2026-04): Spontaneous persuasion audit in everyday conversation.
• arXiv:2602.06176 (2026-02): Large-scale reasoning-failure taxonomy.

Your task:
(1) RE-TEST EACH CONSTRAINT. For each finding above (attestation shortcuts, SFT degradation, generalization failure, presupposition acceptance, spontaneous persuasion), determine whether newer models, post-training methods (e.g., best-of-N, process-reward training, reasoning validators), tooling (formal-logic grounding, fallacy-detection harnesses), or multi-agent orchestration have since RELAXED or OVERTURNED it. Separate the durable question (What training objective actually builds robust fallacy resistance?) from perishable claims (e.g., SFT always degrades reasoning). State plainly where constraints still hold and cite what resolved them.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Look for: papers showing NLI fine-tuning *does* reduce fallacy susceptibility under specific conditions; work proving that theoretical-framework instruction alone isn't necessary; studies showing agreeableness training can be decoupled from presupposition acceptance.
(3) Propose 2 research questions that ASSUME the regime may have moved: (a) If confidence-based reward signals and process supervision are now the norm, does fallacy susceptibility still track attestation bias, or has the bottleneck shifted to reward misspecification? (b) Do multi-agent setups (peer-disagreement, Socratic questioning, formal-logic critiquing) bypass the need for explicit theoretical frameworks, and if so, is fine-tuning still the barrier?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

How does fine-tuning on natural language inference affect fallacy susceptibility?

Sources 10 notes

Next inquiring lines