Why do language models struggle with questions containing false assumptions?
Do LLMs reliably detect and reject questions built on false premises? The (QA)2 benchmark tests this directly, measuring whether models can identify problematic assumptions embedded in naturally plausible questions.
The (QA)2 benchmark (Question Answering with Questionable Assumptions) evaluates models on naturally occurring search engine queries — questions that may or may not contain false or unverifiable assumptions. On questions with questionable assumptions, models achieved roughly half the performance of their scores on valid questions in zero-shot settings. The best model (text-davinci-003 with in-context demonstrations) reached 56% human-judged acceptability end-to-end.
The key challenge: questions with false assumptions "in the wild often do not stand out as bad questions." A question like "When did Marie Curie discover Uranium?" requires topical expertise to detect the false assumption. In contrast, artificial examples ("Which linguist invented the lightbulb?") flag themselves immediately. Real questionable assumptions are embedded in naturally plausible-sounding questions.
Detection subtasks: binary detection of questionable assumptions (64% accuracy) and assumption verification (72%) were higher than end-to-end QA (56%), suggesting that even when models identify the false assumption, generating an appropriate response remains difficult. The response must simultaneously: detect the false presupposition, signal its falsity, correct it if possible, and then answer the actual question or explain why it can't be answered.
This quantifies the performance gap that Why do language models accept false assumptions they know are wrong? identifies qualitatively. The ~50% performance drop is measurable, systematic, and not solved by scale — the text-davinci series improved dramatically over previous models but the gap persists.
Inquiring lines that use this note as a source 16
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- What makes factual verification difficult in inter-model debate?
- Why do LLMs fail to actively reject false presuppositions in conversation?
- Can fact-checking systems use LLMs reliably if models abandon correct positions under pressure?
- Why do true and false LLM outputs use the same mechanism?
- Why can LLMs identify argument structure but not check warrants?
- Why do LLMs struggle with negation and exception handling?
- How do LLMs handle false presuppositions embedded in user questions?
- Can models detect false presuppositions when they actually possess the knowledge?
- Why are false presuppositions harder to spot when they sound plausible?
- What makes correcting a false assumption harder than just detecting it?
- Why do benchmark tests fail to detect LLM comprehension gaps?
- How does the Question Under Discussion shape what counts as presupposed?
- How do structured prompts force LLMs to check for contradictions in evidence?
- Why do models detect false assumptions but still fail to correct them appropriately?
- Can reasoning models reject ill-posed questions or do they overthink?
- What makes an argument fallacious according to formal linguistic criteria?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Why do language models accept false assumptions they know are wrong?
Explores why LLMs fail to reject false presuppositions embedded in questions even when they possess correct knowledge about the topic. This matters because it reveals a grounding failure distinct from knowledge deficits.
same failure domain; (QA)2 provides the performance quantification
-
Why do speakers need to actively calibrate shared reference?
Explores whether using the same words guarantees speakers mean the same thing. Investigates how referential grounding differs across people and what collaborative work is needed to establish true understanding.
questionable assumption handling requires exactly this: detecting when the questioner's presuppositions diverge from fact
-
Why are presuppositions more persuasive than direct assertions?
Explores why presenting information as shared background rather than as a claim makes it more persuasive to audiences. This matters because it reveals how language structure itself can bypass critical evaluation.
false presuppositions embedded in plausible-sounding questions are especially difficult to detect because they carry the persuasive force of backgrounded claims
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey
- Can LLMs Ground when they (Don't) Know: A Study on Direct and Loaded Political Questions
- Procedural Knowledge in Pretraining Drives Reasoning in Large Language Models
- AbstentionBench: Reasoning LLMs Fail on Unanswerable Questions
- Less is More: Recursive Reasoning with Tiny Networks
- How susceptible are LLMs to Logical Fallacies?
- LLMs Struggle to Reject False Presuppositions when Misinformation Stakes are High
- Critical-Questions-of-Thought: Steering LLM reasoning with Argumentative Querying
Original note title
llms underperform by approximately 50% on questions with false assumptions compared to valid questions