SYNTHESIS NOTE
Reasoning, Retrieval, and Evaluation Language, Text, and Discourse

Why do language models struggle with questions containing false assumptions?

Do LLMs reliably detect and reject questions built on false premises? The (QA)2 benchmark tests this directly, measuring whether models can identify problematic assumptions embedded in naturally plausible questions.

Synthesis note · 2026-02-21 · sourced from Natural Language Inference
Where exactly do LLMs break down with language structure? How should researchers navigate LLM reasoning research?

The (QA)2 benchmark (Question Answering with Questionable Assumptions) evaluates models on naturally occurring search engine queries — questions that may or may not contain false or unverifiable assumptions. On questions with questionable assumptions, models achieved roughly half the performance of their scores on valid questions in zero-shot settings. The best model (text-davinci-003 with in-context demonstrations) reached 56% human-judged acceptability end-to-end.

The key challenge: questions with false assumptions "in the wild often do not stand out as bad questions." A question like "When did Marie Curie discover Uranium?" requires topical expertise to detect the false assumption. In contrast, artificial examples ("Which linguist invented the lightbulb?") flag themselves immediately. Real questionable assumptions are embedded in naturally plausible-sounding questions.

Detection subtasks: binary detection of questionable assumptions (64% accuracy) and assumption verification (72%) were higher than end-to-end QA (56%), suggesting that even when models identify the false assumption, generating an appropriate response remains difficult. The response must simultaneously: detect the false presupposition, signal its falsity, correct it if possible, and then answer the actual question or explain why it can't be answered.

This quantifies the performance gap that Why do language models accept false assumptions they know are wrong? identifies qualitatively. The ~50% performance drop is measurable, systematic, and not solved by scale — the text-davinci series improved dramatically over previous models but the gap persists.

Inquiring lines that use this note as a source 16

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
14 direct connections · 178 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

llms underperform by approximately 50% on questions with false assumptions compared to valid questions