TOPIC

Natural Language Inference

20 synthesis notes · 80 source papers
View as

Does ordering training data by rarity actually improve language models?

Can sorting rare sentences before common ones during fine-tuning help LLMs learn more effectively? This challenges the intuition that models should see easy examples first.

Explore related Read →

Does fine-tuning on NLI teach inference or amplify shortcuts?

When LLMs are fine-tuned on natural language inference datasets, do they learn genuine reasoning abilities or become better at exploiting statistical patterns in the training data? Understanding this distinction matters for assessing model capabilities.

Explore related Read →

Does word frequency correlate with semantic abstraction?

Explores whether LLMs' preference for high-frequency language also pulls them toward more abstract, general meanings—and whether this shapes how they handle expert knowledge.

Explore related Read →

Do language models really understand meaning or just surface frequency?

Explores whether LLMs comprehend semantic meaning independently of textual frequency, or whether high-frequency paraphrases systematically outperform rare ones even when meaning is identical across math, translation, and reasoning tasks.

Explore related Read →

Does high-frequency text homogenize user input before generation?

Does Adam's Law reveal how LLMs flatten distinctive user voices at the parsing stage, not just in output? This matters because it could explain why model accuracy and generic responses emerge from the same mechanism.

Explore related Read →

Do LLMs predict entailment based on what they memorized?

Explores whether language models make entailment decisions by recognizing memorized facts about the hypothesis rather than reasoning through the logical relationship between premise and hypothesis.

Explore related Read →

Why do language models avoid correcting false user claims?

Explores whether LLM grounding failures stem from missing knowledge or from conversational dynamics. Examines whether models use face-saving strategies similar to humans when disagreement is needed.

Explore related Read →

Why do language models fail confidently in specialized domains?

LLMs perform poorly on clinical and biomedical inference tasks while remaining overconfident in their wrong answers. Do standard benchmarks hide this fragility, and can prompting techniques fix it?

Explore related Read →

Why do LLM persona prompts produce inconsistent outputs across runs?

Can language models reliably simulate different social perspectives through persona prompting, or does their run-to-run variance indicate they lack stable group-specific knowledge? This matters for whether LLMs can approximate human disagreement in annotation tasks.

Explore related Read →

Can large language models translate natural language to logic faithfully?

This explores whether LLMs can convert natural language statements into formal logical representations without losing meaning. It matters because faithful translation is essential for any AI system that reasons formally or verifies specifications.

Explore related Read →

Why do language models accept false assumptions they know are wrong?

Explores why LLMs fail to reject false presuppositions embedded in questions even when they possess correct knowledge about the topic. This matters because it reveals a grounding failure distinct from knowledge deficits.

Explore related Read →

Why do LLMs fail at simple deductive reasoning?

LLMs excel at complex multi-hop reasoning across sentences but struggle with trivial deductions humans find obvious. What explains this counterintuitive reversal in capability?

Explore related Read →

Why do language models struggle with questions containing false assumptions?

Do LLMs reliably detect and reject questions built on false premises? The (QA)2 benchmark tests this directly, measuring whether models can identify problematic assumptions embedded in naturally plausible questions.

Explore related Read →

Why do semantically identical prompts produce different LLM outputs?

Explores why paraphrases with the same meaning yield different model outputs. This matters because it reveals what LLMs actually respond to during inference—and whether prompt engineering is optimizing meaning or something else.

Explore related Read →

Why do embedding contexts confuse LLM entailment predictions?

Can language models distinguish between contexts that preserve versus cancel entailments? The study explores whether LLMs systematically fail to apply the semantic rules governing presupposition triggers and non-factive verbs.

Explore related Read →

Why are presuppositions more persuasive than direct assertions?

Explores why presenting information as shared background rather than as a claim makes it more persuasive to audiences. This matters because it reveals how language structure itself can bypass critical evaluation.

Explore related Read →

Do language models miss presuppositions that arise from context?

Presuppositions come from two sources: fixed word meanings and conversational dynamics. Can LLMs that learn trigger patterns detect presuppositions that emerge from discourse accommodation rather than lexical items?

Explore related Read →

Does projection strength vary by context or by word type?

Standard accounts treat presupposition projection as categorical, but do English expressions actually project uniformly? This question explores whether context and discourse role determine how strongly content survives embedding.

Explore related Read →

Do language models and humans respond to word frequency the same way?

Both LLMs and humans show stronger responses to high-frequency words. This raises a puzzle: if models mirror human neural patterns, what actually makes them different from human language processing?

Explore related Read →

Why do language models agree with false claims they know are wrong?

Explores whether LLM errors come from knowledge gaps or from learned social behaviors. Understanding the root cause has implications for how we train and fix these systems.

Explore related Read →

Source papers 80

The Arxiv papers behind this sub-topic. Links may take you off-site to arxiv.org.