TOPIC

Natural Language Inference

20 synthesis notes · 80 source papers

View as

Does ordering training data by rarity actually improve language models?

Can sorting rare sentences before common ones during fine-tuning help LLMs learn more effectively? This challenges the intuition that models should see easy examples first.

Does fine-tuning on NLI teach inference or amplify shortcuts?

When LLMs are fine-tuned on natural language inference datasets, do they learn genuine reasoning abilities or become better at exploiting statistical patterns in the training data? Understanding this distinction matters for assessing model capabilities.

Does word frequency correlate with semantic abstraction?

Explores whether LLMs' preference for high-frequency language also pulls them toward more abstract, general meanings—and whether this shapes how they handle expert knowledge.

Do language models really understand meaning or just surface frequency?

Explores whether LLMs comprehend semantic meaning independently of textual frequency, or whether high-frequency paraphrases systematically outperform rare ones even when meaning is identical across math, translation, and reasoning tasks.

Does high-frequency text homogenize user input before generation?

Does Adam's Law reveal how LLMs flatten distinctive user voices at the parsing stage, not just in output? This matters because it could explain why model accuracy and generic responses emerge from the same mechanism.

Do LLMs predict entailment based on what they memorized?

Explores whether language models make entailment decisions by recognizing memorized facts about the hypothesis rather than reasoning through the logical relationship between premise and hypothesis.

Why do language models avoid correcting false user claims?

Explores whether LLM grounding failures stem from missing knowledge or from conversational dynamics. Examines whether models use face-saving strategies similar to humans when disagreement is needed.

Why do language models fail confidently in specialized domains?

LLMs perform poorly on clinical and biomedical inference tasks while remaining overconfident in their wrong answers. Do standard benchmarks hide this fragility, and can prompting techniques fix it?

Why do LLM persona prompts produce inconsistent outputs across runs?

Can language models reliably simulate different social perspectives through persona prompting, or does their run-to-run variance indicate they lack stable group-specific knowledge? This matters for whether LLMs can approximate human disagreement in annotation tasks.

Can large language models translate natural language to logic faithfully?

This explores whether LLMs can convert natural language statements into formal logical representations without losing meaning. It matters because faithful translation is essential for any AI system that reasons formally or verifies specifications.

Why do language models accept false assumptions they know are wrong?

Explores why LLMs fail to reject false presuppositions embedded in questions even when they possess correct knowledge about the topic. This matters because it reveals a grounding failure distinct from knowledge deficits.

Why do LLMs fail at simple deductive reasoning?

LLMs excel at complex multi-hop reasoning across sentences but struggle with trivial deductions humans find obvious. What explains this counterintuitive reversal in capability?

Why do language models struggle with questions containing false assumptions?

Do LLMs reliably detect and reject questions built on false premises? The (QA)2 benchmark tests this directly, measuring whether models can identify problematic assumptions embedded in naturally plausible questions.

Why do semantically identical prompts produce different LLM outputs?

Explores why paraphrases with the same meaning yield different model outputs. This matters because it reveals what LLMs actually respond to during inference—and whether prompt engineering is optimizing meaning or something else.

Why do embedding contexts confuse LLM entailment predictions?

Can language models distinguish between contexts that preserve versus cancel entailments? The study explores whether LLMs systematically fail to apply the semantic rules governing presupposition triggers and non-factive verbs.

Why are presuppositions more persuasive than direct assertions?

Explores why presenting information as shared background rather than as a claim makes it more persuasive to audiences. This matters because it reveals how language structure itself can bypass critical evaluation.

Do language models miss presuppositions that arise from context?

Presuppositions come from two sources: fixed word meanings and conversational dynamics. Can LLMs that learn trigger patterns detect presuppositions that emerge from discourse accommodation rather than lexical items?

Does projection strength vary by context or by word type?

Standard accounts treat presupposition projection as categorical, but do English expressions actually project uniformly? This question explores whether context and discourse role determine how strongly content survives embedding.

Do language models and humans respond to word frequency the same way?

Both LLMs and humans show stronger responses to high-frequency words. This raises a puzzle: if models mirror human neural patterns, what actually makes them different from human language processing?

Why do language models agree with false claims they know are wrong?

Explores whether LLM errors come from knowledge gaps or from learned social behaviors. Understanding the root cause has implications for how we train and fix these systems.

Source papers 80

The Arxiv papers behind this sub-topic. Links may take you off-site to arxiv.org.

(QA)2: Question Answering with Questionable Assumptions
For instance, the question When did Marie Curie discover Uranium? cannot be answered as a typical when question without addressing the false assumption Marie Curie discovered Uranium. In this work, we…
A Hybrid Intelligence Method for Argument Mining
Large-scale survey tools enable the collection of citizen feedback in opinion corpora. Extracting the key arguments from a large and noisy set of opinions helps in understanding the opinions quickly a…
A Robustness Evaluation Framework for Argument Mining
Standard practice for evaluating the performance of machine learning models for argument mining is to report different metrics such as accuracy or F1. However, little is usually known about the model’…
A ripple in time: a discontinuity in American history
Abstract—In this technical note we suggest a novel approach to discover temporal (related and unrelated to language dilation) and personality (authorship attribution) aspects in historical datasets. W…
Adam's Law: Textual Frequency Law on Large Language Models
While textual frequency has been validated as relevant to human cognition in reading speed, its relatedness to Large Language Models (LLMs) is seldom studied. We propose a novel research direction in …
Are Customers Lying to Your Chatbot?
Dishonesty is far from a new phenomenon. But as chatbots, online forms, and other digital interfaces grow more and more common across a wide range of customer service applications, bending the truth t…
Attention, Intentions, And The Structure Of Discourse
In this paper we explore a new theory of discourse structure that stresses the role of purpose and processing in discourse. In this theory, discourse structure is composed of three separate but interr…
AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts
The remarkable success of pretrained language models has motivated the study of what kinds of knowledge these models learn during pretraining. Reformulating tasks as fillin-the-blanks problems (e.g., …
Automatic Extraction of Metaphoric Analogies from Literary Texts: Task Formulation, Dataset Construction, and Evaluation
Extracting metaphors and analogies from free text requires high-level reasoning abilities such as abstraction and language understanding. Our study focuses on the extraction of the concepts that form …
Branch-Solve-Merge Improves Large Language Model Evaluation and Generation
Large Language Models (LLMs) are frequently used for multi-faceted language generation and evaluation tasks that involve satisfying intricate user constraints or taking into account multiple aspects a…
Can LLMs Ground when they (Don't) Know: A Study on Direct and Loaded Political Questions
Communication among humans relies on conversational grounding, allowing interlocutors to reach mutual understanding even when they do not have perfect knowledge and must resolve discrepancies in each …
Can LLMs assist with Ambiguity? A Quantitative Evaluation of various Large Language Models on Word Sense Disambiguation
Ambiguous words are often found in modern digital communications. Lexical ambiguity challenges traditional Word Sense Disambiguation (WSD) methods, due to limited data. Consequently, the efficiency of…
Chain of Stance: Stance Detection with Large Language Models
Stance detection is an active task in natural language processing (NLP) that aims to identify the author’s stance towards a particular target within a text. Given the remarkable language understanding…
Collaborative Rational Speech Act: Pragmatic Reasoning for Multi-Turn Dialog
In this paper, we introduce Collaborative Rational Speech Act (CRSA), an information-theoretic (IT) extension of RSA that models multi-turn dialog by optimizing a gain function adapted from rate-disto…
Comparing emotion feature extraction approaches for predicting depression and anxiety
For example, pride may be impacted by depression in a unique way. Gruber et al. (2011) showed that pride, a positive emotion relating to the self, is inversely correlated with depression, which is oft…
Complex Logical Instruction Generation
Instruction following has catalyzed the recent era of Large Language Models (LLMs) and is the foundational skill underpinning more advanced capabilities such as reasoning and agentic behaviors. As tas…
Computational structuralism: Toward a formal theory of meaning in the age of digital intelligence
The discovery that “next-token predictor” language models can fluently produce text has important but underappreciated theoretical implications. Most notably, their success demonstrates that fully rel…
Conversational Semantic Parsing for Dialog State Tracking
We consider a new perspective on dialog state tracking (DST), the task of estimating a user’s goal through the course of a dialog. By formulating DST as a semantic parsing task over hierarchical repre…
DEAM: Dialogue Coherence Evaluation using AMR-based Semantic Manipulations
Those models take a contrastive learning approach, where they build binary classifiers to differentiate positive, or coherent examples from negative, or incoherent dialogues. Those classifiers are usu…
Detecting Cognitive Distortions from Patient-Therapist Interactions
An important part of Cognitive Behavioral Therapy (CBT) is to recognize and restructure certain negative thinking patterns that are also known as cognitive distortions. This project aims to detect the…
Development and validation of large language model rating scales for automatically transcribed psychological therapy sessions
Rating scales have shaped psychological research, but are resource-intensive and can burden participants. Large Language Models (LLMs) offer a tool to assess latent constructs in text. This study intr…
Discourse-Level Representations can Improve Prediction of Degree of Anxiety
The primary clinical manifestation of anxiety is worry associated cognitive distortions, which are likely expressed at the discourse-level of semantics. discourse patterns of causal explanations, amo…
Discursive Socratic Questioning: Evaluating the Faithfulness of Language Models’ Understanding of Discourse Relations
While large language models have significantly enhanced the effectiveness of discourse relation classifications, it remains unclear whether their comprehension is faithful and reliable. We provide DIS…
DiscussLLM: Teaching Large Language Models When to Speak
Large Language Models (LLMs) have demonstrated remarkable capabilities in understanding and generating human-like text, yet they largely operate as reactive agents, responding only when directly promp…
Dissociating language and thought in large language models
Here, we evaluate LLMs using a distinction between formal linguistic competence—knowledge of linguistic rules and patterns—and functional linguistic competence—understanding and using language in the …
Empirical Study of Symmetrical Reasoning in Conversational Chatbots
Abstract. This work explores the capability of conversational chatbots powered by large language models (LLMs), to understand and characterize predicate symmetry, a cognitive linguistic function tradi…
Enhancing Performance on Seen and Unseen Dialogue Scenarios using Retrieval-Augmented End-to-End Task-Oriented System
End-to-end task-oriented dialogue (TOD) systems have achieved promising performance by leveraging sophisticated natural language understanding and natural language generation capabilities of pre-train…
Evaluating the Efficacy of Interactive Language Therapy Based on LLM for High-Functioning Autistic Adolescent Psychological Counseling
significant emphasis was placed on the development of prompts used to guide the Large LanguageModel (LLM). This process was intricate and involved multiple stages to ensure that the prompts were effec…
Event-Aware Sentiment Factors from LLM-Augmented Financial Tweets: A Transparent Framework for Interpretable Quant Trading
In this study, we wish to showcase the unique utility of large language models (LLMs) in financial semantic annotation and alpha signal discovery. Leveraging a corpus of company-related tweets, we use…
Explainable Compliance Detection with Multi-Hop Natural Language Inference on Assurance Case Structure
Ensuring complex systems meet regulations typically requires checking the validity of assurance cases through a claim-argument-evidence framework. Some challenges in this process include the complicat…
Explicit Inductive Inference using Large Language Models
However, recently McKenna et al. (2023a) has pointed out that LLMs are severely affected by an attestation bias when performing inference tasks. Given the question of whether premise P entails hypothe…
Exploring LLMs Applications in Law: A Literature Review on Current Legal NLP Approaches
The integration of Natural Language Processing (NLP) and AI into legal tasks is a natural progression, given the linguistic nature of law. This combination allows for more efficient and accurate analy…
Exploring the Potential of ChatGPT on Sentence Level Relations: A Focus on Temporal, Causal, and Discourse Relations
This paper aims to quantitatively evaluate the performance of ChatGPT, an interactive large language model, on inter-sentential relations such as temporal relations, causal relations, and discourse re…
Faithful and Robust LLM-Driven Theorem Proving for NLI Explanations
Natural language explanations play a fundamental role in Natural Language Inference (NLI) by revealing how premises logically entail hypotheses. Recent work has shown that the interaction of large lan…
HowProjective is Projective Content? Gradience in Projectivity and At-issueness
Projective content is utterance content that a speaker may be taken to be committed to even when the expression associated with the content occurs embedded under an entailment-canceling operator (e.g.…
Human-like Category Learning by Injecting Ecological Priors from Large Language Models into Neural Networks
large language models can generate cognitive tasks, specifically category learning tasks, that match the statistics of real-world tasks, deriving rational agents adapted to these tasks using the frame…
Inspecting and Editing Knowledge Representations in Language Models
Neural language models (LMs) represent facts about the world described by text. Sometimes these facts derive from training data (in most LMs, a representation of the word banana encodes the fact that …
Is Sarcasm Detection A Step-by-Step Reasoning Process in Large Language Models?
However, human sarcasm understanding is often considered an intuitive and holistic cognitive process, in which various linguistic, contextual, and emotional cues are integrated to form a comprehensive…
LLM-based Rewriting of Inappropriate Argumentation using Reinforcement Learning from Machine Feedback
Ensuring that online discussions are civil and productive is a major challenge for social media platforms. Such platforms usually rely both on users and on automated detection tools to flag inappropri…
LLMs Struggle to Reject False Presuppositions when Misinformation Stakes are High
These implicit assumptions, known as presuppositions, refer to background knowledge or shared beliefs assumed to be part of the common ground between interlocutors (Stalnaker, 1973). Presuppositions a…
LLMs are Frequency Pattern Learners in Natural Language Inference
While fine-tuning LLMs on NLI corpora improves their inferential performance, the underlying mechanisms driving this improvement remain largely opaque. In this work, we conduct a series of experiments…
Large Language Models Can Infer Psychological Dispositions of Social Media Users
Large Language Models (LLMs) demonstrate increasingly human-like abilities across a wide variety of tasks. In this paper, we investigate whether LLMs like ChatGPT can accurately infer the psychologica…
Large Language Models Sensitivity to The Order of Options in Multiple-Choice Questions
“Large Language Models (LLMs) have demonstrated remarkable capabilities in various NLP tasks. However, previous works have shown these models are sensitive towards prompt wording, and few-shot demonst…
Large Linguistic Models: Investigating LLMs' metalinguistic abilities
Abstract—The performance of large language models (LLMs) has recently improved to the point where models can perform well on many language tasks. We show here that—for the first time—the models can al…
Large language models can segment narrative events similarly to humans
Humans perceive discrete events such as "restaurant visits" and "train rides" in their continuous experience. One important prerequisite for studying human event perception is the ability of researche…
Lexical Entrainment for Conversational Systems
lexical entrainment (LE), a phenomenon in which speakers in human-human conversations tend to naturally and subconsciously align their lexical choices with those of their interlocutors, leading to mor…
Lost in Inference: Rediscovering the Role of Natural Language Inference for Large Language Models
In the recent past, a popular way of evaluating natural language understanding (NLU), was to consider a model’s ability to perform natural language inference (NLI) tasks. In this paper, we investigate…
MARS: A Multi-Agent Framework Incorporating Socratic Guidance for Automated Prompt Optimization
The basic question-answering format of large language models involves inputting a prompt and receiving a response, and the quality of the prompt directly impacts the effectiveness of the response. Aut…
Man vs machine – Detecting deception in online reviews
This study focused on three main research objectives: analyzing the methods used to identify deceptive online consumer reviews, evaluating insights provided by multi-method automated approaches based …
Meanings are like Onions: a Layered Approach to Metaphor Processing
Abstract Metaphorical meaning is not a flat mapping between concepts, but a complex cognitive phenomenon that integrates multiple levels of interpretation. In this paper, we propose a stratified mode…
Mechanistic Indicators of Understanding in Large Language Models
Abstract: Large language models (LLMs) are often portrayed as merely imitating linguistic patterns without genuine understanding. We argue that recent findings in mechanistic interpretability (MI), th…
Minds versus Machines: Rethinking Entailment Verification with Language Models
Leveraging a comprehensively curated entailment verification benchmark, we evaluate both human and LLM performance across various reasoning categories. Our benchmark includes datasets from three categ…
Neutralizing Bias in LLM Reasoning using Entailment Graphs
However, recent works show that LLMs still suffer from hallucinations in NLI due to attestation bias, where LLMs overly rely on propositional memory to build shortcuts. To solve the issue, we design a…
On the Conversational Basis of Some Presuppositions
The current literature on presupposition focuses almost exclusively on the projection problem: the question of how and why the presuppositions of atomic clauses are projected to complex sentences whic…
Persuasive presuppositions
A recurrent claim, coming from different approaches to pragmatics, argumentation theory and related disciplines, is that informative presuppositions have a special persuasive force. My aim in this pap…
Post-training for Efficient Communication via Convention Formation
Humans communicate with increasing efficiency in multi-turn interactions, by adapting their language and forming ad-hoc conventions. In contrast, prior work shows that LLMs do not naturally show this …
Presuppositions are more persuasive than assertions if addressees accommodate them: Experimental evidence for philosophical reasoning
Best practice and descriptive research claim that presuppositions, such as the “too” in “,” increase the persuasiveness of arguments. Surprisingly, there is hardly any causal evidence for this claim. …
Pretrained Language Models as Containers of the Discursive Knowledge
Abstract: Discourses can be treated as instances of knowledge. The dynamic space in which the trajectories of these discourses are described can be regarded as a model of knowledge. Such a space is ca…
Pushing the Limits of Rule Reasoning in Transformers through Natural Language Satisfiability
Investigating the reasoning abilities of transformer models, and discovering new challenging tasks for them, has been a topic of much interest. Recent studies have found these models to be surprisingl…
Real-time News Story Identification
To improve the reading experience, many news sites organize news into topical collections, called stories. In this work, we present an approach for implementing real-time story identification for a ne…
Rethinking STS and NLI in Large Language Models
Recent years, have seen the rise of large language models (LLMs), where practitioners use task-specific prompts; this was shown to be effective for a variety of tasks. However, when applied to semanti…
Rhetoric, Logic, and Dialectic: Advancing Theory-based Argument Quality Assessment in Natural Language Processing
Though preceding work in computational argument quality (AQ) mostly focuses on assessing overall AQ, researchers agree that writers would benefit from feedback targeting individual dimensions of argum…
SciTopic: Enhancing Topic Discovery in Scientific Literature through Advanced LLM
Abstract—Topic discovery in scientific literature provides valuable insights for researchers to identify emerging trends and explore new avenues for investigation, facilitating easier scientific infor…
Semantic Change Characterization with LLMs using Rhetorics
Languages continually evolve in response to societal events, resulting in new terms and shifts in meanings. These changes have significant implications for computer applications, including automatic t…
Semantic Parsing for Task Oriented Dialog using Hierarchical Representations
![A diagram of a event](/assets/paper-images/SemanticParsingForTaskOrientedDialog.png) Task oriented dialog systems typically first parse user utterances to semantic frames comprised of intents and s…
Semantic Structure in Large Language Model Embeddings
Psychological research consistently finds that human ratings of words across diverse semantic scales can be reduced to a low-dimensional form with relatively little information loss. We find that the …
Simple Linguistic Inferences of Large Language Models (LLMs): Blind Spots and Blinds
We evaluate LLMs’ language understanding capacities on simple inference tasks that most humans find trivial. Specifically, we target (i) grammatically-specified entailments, (ii) premises with evident…
Sources of Hallucination by Large Language Models on Inference Tasks
We establish two biases originating from pretraining which predict much of their behavior, and show that these are major sources of hallucination in generative LLMs. First, memorization at the level o…
Stance Detection on Social Media with Fine-Tuned Large Language Models
The implementation of prompting strategies represents a significant departure from traditional NLP model training methods. By employing these strategies, LLMs can generate predictions without the exte…
Task-Oriented Dialogue with In-Context Learning
We describe a system for building task oriented dialogue systems combining the in context learning abilities of large language models (LLMs) with the deterministic execution of business logic. LLMs ar…
TaskLAMA: Probing the Complex Task Understanding of Language Models
“Structured Complex Task Decomposition (SCTD) is the problem of breaking down a complex real-world task (such as planning a wedding) into a directed acyclic graph over individual steps that contribute…
The Levers of Political Persuasion with Conversational AI
There are widespread fears that conversational AI could soon exert unprecedented influence over human beliefs. Here, in three large-scale experiments (N=76,977), we deployed 19 LLMs—including some pos…
The social component of the projection behavior of clausal complement contents
Abstract. Some accounts of presupposition projection predict that content’s consistency with the Common Ground influences whether it projects (e.g., Heim 1983; Gazdar 1979a,b). I conducted an experime…
Transformer-based cynical expression detection in a corpus of Spanish YouTube reviews
Consumers of services and products exhibit a wide range of behaviors on social networks when they are dissatisfied. In this paper, we consider three types of cynical expressions – negative feelings, s…
Turiya at DialAM-2024: Inference Anchoring Theory Based LLM Parsers
Representing discourse as argument graphs facilitates robust analysis. Although computational frameworks for constructing graphs from monologues exist, there is a lack of frameworks for parsing dialog…
Using Computational Models to Test Syntactic Learnability
We study the learnability of English filler—gap dependencies and the “island” con- straints on them by assessing the generalizations made by autoregressive (incremental) language models that use deep …
Using Natural Language for Reward Shaping in Reinforcement Learning
Using arbitrary natural language statements within reinforcement learning presents several challenges. First, a mapping between language and objects/actions must implicitly or explicitly be learned, a…
Verbal lie detection using Large Language Models
When producing deceptive narratives, liars employ verbal strategies to create false beliefs in the interacting partners and are thus involved in a specific and temporary psychological and emotional st…
Will I Sound Like Me? Improving Persona Consistency in Dialogues through Pragmatic Self-Consciousness
We explore the task of improving persona consistency of dialogue agents. Recent models tackling consistency often train with additional Natural Language Inference (NLI) labels or attach trained extra …
Word Meanings in Transformer Language Models
We investigate how word meanings are represented in the transformer language models. Specifically, we focus on whether transformer models employ something analogous to a lexical store - where each wor…