Why do embedding contexts confuse LLM entailment predictions?
Can language models distinguish between contexts that preserve versus cancel entailments? The study explores whether LLMs systematically fail to apply the semantic rules governing presupposition triggers and non-factive verbs.
"Simple Linguistic Inferences of LLMs" targets inferences humans find trivial — grammatically-specified entailments ("You've eaten all my apples" entails "Someone ate something"), evidential adverbs of uncertainty ("allegedly" cancels the entailment of the clause), and monotonicity entailments (specific→general). LLMs show moderate-to-low performance on all three.
But the more revealing finding is what happens when the premise is embedded in grammatical contexts. Two types of embedding contexts should have opposite effects:
- Presupposition triggers (factive verbs: "realized that", "regret that"; temporal clauses: "before X"): embedding under these should not change the original entailment relations — the premise's entailments are preserved because presuppositions project through these contexts.
- Non-factive verbs (believe, imagine, suspect, feel): embedding under these should cancel entailments — "I suspect a balloon hit a light post" no longer entails "something hit a light post."
LLMs cannot make this discrimination. ChatGPT in regular prompting mode treats both presupposition triggers and non-factives as hints toward entailment. In chain-of-thought mode, it treats both as hints against entailment. The embedding context overwhelms the semantics of the embedded content, acting as a "blind" that masks the relevant inferential relationships.
This is a different kind of failure from general reasoning difficulty — these are structural failures where syntactic packaging overrides semantic content. The model responds to the embedding verb (factive vs. non-factive) as a surface cue rather than computing its effect on the entailment relation. This is precisely the pattern Can models pass tests while missing the actual grammar? predicts: surface cues substituting for structural analysis.
The persistence across multiple prompts and LLMs confirms this is systematic, not incidental — "a systematic issue" in the paper's words.
Inquiring lines that use this note as a source 38
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Can LLMs propose pivots that change what counts as background context?
- How do training-data priors influence model defaults when context is ambiguous?
- Can LLMs infer situational context the way humans do pragmatically?
- Why does removing language from its context destroy what makes it work?
- How do fixed pragmatic templates prevent models from understanding context?
- Is interpretive multiplicity a bug in language or a feature?
- How do entailment checks prevent synthetic data from degrading retrieval corpora?
- What causes LLMs to ignore unstated constraints they know about?
- How does removing a spurious cue change LLM performance?
- Why does semantic decoupling specifically break LLM reasoning abilities?
- Do LLMs compute scalar implicature differently across conversational contexts?
- Can frame semantics explain why context matters more than word similarity?
- Why do explicit discourse connectives help LLMs but implicit relations cause failures?
- Why do LLMs fail to actively reject false presuppositions in conversation?
- How do embedding contexts like presupposition triggers affect LLM entailment reasoning?
- What distinguishes entity errors from relation errors in LLM output?
- Why can LLMs identify argument structure but not check warrants?
- Why do LLMs explain evidence accurately while missing its implications?
- Why do LLMs choose surface-order quantifier scope over contextually correct readings?
- Why do LLMs perform better on explicit discourse connectives than implicit relations?
- What specific linguistic features cause LLMs to fail at trivial entailment?
- How do LLMs handle false presuppositions embedded in user questions?
- Why do explicit discourse connectives work when implicit relations fail?
- Why are false presuppositions harder to spot when they sound plausible?
- How does frame selection differ from frame application in meaning-making?
- Can LLMs compute how presuppositions project through embedded clauses?
- How does the Question Under Discussion shape what counts as presupposed?
- Can presupposition projection strength vary by context in embeddings?
- Why do non-factive verbs and triggers both fool language models?
- Why do language models treat presupposition triggers as categorical patterns?
- Can the same predicate generate different projection strength in different contexts?
- What makes structural logic correlate so strongly with contextual consistency?
- How do structured prompts force LLMs to check for contradictions in evidence?
- How does bidirectional entailment distinguish semantic equivalence from token similarity?
- Why does context work differently in AI than in conventional software?
- Why does teacher forcing fail to capture long-range dependencies?
- How do training associations override context information in language models?
- What semantic information is necessary to preserve for sound LLM reasoning?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can models pass tests while missing the actual grammar?
Do language models succeed on grammatical benchmarks by learning surface patterns rather than structural rules? This matters because correct outputs may hide reliance on shallow heuristics that fail on novel structures.
same mechanism: surface context cues substituting for structural computation
-
Does LLM grammatical performance decline with structural complexity?
This explores whether LLMs fail uniformly at grammar or whether their failures follow a predictable pattern tied to input complexity. Understanding the relationship matters for deciding when LLM annotations are reliable.
embedding contexts add structural complexity; this is another specific complexity type that causes systematic failure
-
Why does ChatGPT fail at implicit discourse relations?
ChatGPT excels when discourse connectives are present but drops to 24% accuracy without them. What does this gap reveal about how LLMs actually process meaning and logical relationships?
parallel structure: surface markers (connectives, embedding verbs) override deeper semantic computation
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Simple Linguistic Inferences of Large Language Models (LLMs): Blind Spots and Blinds
- Sources of Hallucination by Large Language Models on Inference Tasks
- Large Language Models are In-Context Semantic Reasoners rather than Symbolic Reasoners
- Explicit Inductive Inference using Large Language Models
- Can LLMs Ground when they (Don't) Know: A Study on Direct and Loaded Political Questions
- Reasoning Beyond Chain-of-Thought: A Latent Computational Mode in Large Language Models
- Neutralizing Bias in LLM Reasoning using Entailment Graphs
- LLMs Struggle to Reject False Presuppositions when Misinformation Stakes are High
Original note title
presupposition triggers and non-factive verbs are embedding blinds that systematically miscalibrate llm entailment predictions