Can language models recognize when text is deliberately ambiguous?
Explores whether LLMs can identify and handle multiple valid interpretations in a single phrase—a core human language skill that appears largely absent in current models despite their fluency on standard tasks.
AMBIENT (Blevins et al. 2023) is the first evaluation of pretrained LMs specifically on ambiguity recognition and disambiguation. 1,645 linguist-annotated examples with diverse ambiguity types: lexical ambiguity, structural ambiguity, scope ambiguity, and others.
The findings are stark:
- GPT-4 generated disambiguations rated correct by crowdworkers only 32% of the time
- Human reference disambiguations rated correct 90% of the time
- Best finetuned multilabel NLI model predicts the exact label set for ambiguous instances in only 43.6% of cases
Ambiguity management is central to human language understanding. As communicators, we anticipate possible misunderstandings. As listeners, we ask clarifying questions, revise interpretations based on new information, and use contextual factors to select among multiple possible readings. This capacity appears largely absent in current LLMs despite their fluency on standard benchmarks.
The task tests three distinct capabilities that all fail: generating relevant disambiguations, recognizing possible interpretations, and modeling different interpretations in continuation distributions. The failure is not isolated to one type but systematic across the full ambiguity management competence.
Since Do standard NLP benchmarks hide LLM ambiguity failures?, this failure is normally invisible in standard evaluation. The 32% figure is only visible because AMBIENT was designed to include what standard benchmarks exclude.
Augmented prompting can partially mitigate: a systematic approach combining Chain-of-Thought prompting with a knowledge base of sense interpretations, Part-of-Speech tagging, aspect-based filtering, and few-shot examples produces "substantial improvement" on WSD tasks. However, the fundamental challenge persists for highly diverse ambiguous words (10+ distinct senses across noun and verb forms) — current architectures remain "not confident enough" for these cases. The improvement comes from external scaffolding (KB, POS, examples), not from genuine semantic disambiguation competence, which reinforces the finding that LLMs handle explicit structure well but fail when multiple implicit interpretations must be managed simultaneously.
The literary analysis framing: Poetry is controlled ambiguity — deliberate multiplicity of meaning, crafted so that several readings coexist productively. A poem that resolves to a single meaning has failed as a poem. The 32% disambiguation rate means LLMs cannot even recognize the fundamental operation that makes poetry work. They cannot hold ambiguity open. They resolve it — and in resolving it, destroy it. This reframes the AMBIENT finding from a general limitation to a domain-killing one for literary work: the ability to manage ambiguity is not peripheral to literary analysis but central to it. Since Can LLMs truly understand literary meaning or just mechanics?, the ambiguity failure is one of four converging mechanisms behind the mechanics-meaning gap.
Inquiring lines that use this note as a source 85
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Can LLMs infer situational context the way humans do pragmatically?
- Can AI detect sense-of-nonsense the way human readers do?
- How do readers selectively hold frame-related words in mind?
- Why does training data saliency distort how models judge meaning?
- Can language models adapt irony detection to specific communicative contexts?
- What happens when LLMs analyze literary irony that relies on understatement?
- Why do LLMs achieve only 24 percent accuracy on implicit discourse relations?
- What makes ambiguity recognition fundamentally important for poetry analysis?
- Why do NLP benchmarks systematically exclude ambiguous test cases from evaluation?
- How do politeness strategies depend on semantic ambiguity between literal and intended meaning?
- What percentage of natural language relies on plausible deniability through ambiguous phrasing?
- Can language systems learn when to ask for clarification instead of choosing one reading?
- What measurement artifacts emerge when annotators interpret the same question differently?
- How does semantic ambiguity differ from structural ambiguity in language?
- Why do NLP benchmarks exclude ambiguous instances from evaluation?
- Is interpretive multiplicity a bug in language or a feature?
- What other semantic relations benefit from explicit surface markers in text?
- How do humans decide which level of clarification to request?
- Can language models ground clarifications without vision and kinesthetic modalities?
- How should meaning spaces be systematically modeled across different applications?
- Can smaller open-source LLMs reliably detect agreement across unfamiliar topics?
- What role does entity salience play in detecting incoherence?
- Why do LLMs produce semantically acceptable but pragmatically disengaged responses?
- Should emotion systems preserve ambiguity instead of resolving it to one label?
- What semantic classifier design avoids lexical variation without genuine conceptual distinctness?
- Can alignment training prevent the clarification work users need?
- Do LLMs compute scalar implicature differently across conversational contexts?
- Can LLMs improve at metaphor if they handle decoupled semantics better?
- How does implicit meaning processing limit LLM pragmatic reasoning?
- How should designers measure and explain semantic uncertainty to users?
- Can semantic query expansion overcome vocabulary mismatch in corrupted text?
- Is paraphrase invariance a reliable assumption when deploying language models in production?
- How do humans detect which words belong to the same frame together?
- How can vague language serve both cooperative and deceptive communication purposes?
- Why does ambiguity detection require different multi-agent mechanisms than verifiable reasoning tasks?
- How does ambiguity detection connect to models' ability to ask clarifying questions?
- Do LLMs struggle more with semantic accuracy than syntactic correctness across domains?
- Can LLMs infer implicit meaning without surface linguistic markers?
- Why do LLMs fail at implicit elements in literary and poetic text?
- Why do LLMs fail at semantic generalization despite grammatical accuracy?
- Can LLMs distinguish ethical cases that differ only in critical nouns?
- Why do language models overestimate irony likelihood in emoji use?
- Why do NLP benchmarks hide LLM failures in ambiguity handling?
- Can LLMs translate between natural language and formal logic faithfully?
- Can LLMs identify implicit metaphoric mappings that require pragmatic inference?
- How does the inability to manage ambiguity undermine literary analysis tasks?
- Can prompt engineering and external knowledge bases fix ambiguity recognition failures?
- How does the distance between natural language and formal notation affect translation accuracy?
- Why do LLMs choose surface-order quantifier scope over contextually correct readings?
- Can LLMs distinguish between surface requests and underlying mental states in dialogue?
- How does structural depth in sentences predict LLM annotation accuracy?
- How do LLMs handle false presuppositions embedded in user questions?
- How much semantic meaning survives when LLMs paraphrase poetry and literary text?
- Why do different readers extract different meanings from identical text?
- Why can language models detect author style without understanding why it matters?
- Why do NLP models fail at recognizing multiple valid interpretations?
- Can language models ask clarifying questions when sentences are ambiguous?
- How do human annotators disagree systematically on ambiguous examples?
- Why do language models struggle with context-dependent pragmatic interpretation?
- Does model confidence actually explain why paraphrases produce different outputs?
- Does adding multiple interpretations to ambiguous situations respect language more than resolving them?
- Why do benchmark tests fail to detect LLM comprehension gaps?
- Can language models distinguish between novel insight and unjustified conceptual blending?
- Can moral frameworks alone explain why readers understand sentences differently?
- Can LLMs recognize rhetorical devices they cannot actually produce themselves?
- Does chain-of-thought prompting overcome implicit meaning deficits in text analysis?
- How do LLMs compress literary language without losing essential nuance?
- Can LLMs distinguish stylistic patterns that carry meaning from mere convention?
- What training approach enables models to proactively request clarification?
- Can models distinguish between ambiguous and incomplete information inputs?
- Why do language models struggle with evaluative tasks like weighing competing viewpoints?
- Can grammar alone repair misunderstanding without ritual correction work?
- What architectural changes help AI avoid adding interpretations users didn't express?
- Why do LLMs choose incorrect edits despite understanding the task?
- How much does forcing single-choice answers damage alignment with complex intent?
- Why do language models fail at understanding ambiguous or complex requirements?
- Can detectors trained for one task reliably perform differently on unexpected text sources?
- How do LLMs translate informal prose into logically correct formal specifications?
- How can multiple conflicting values coexist in a single LLM system?
- Can LLMs express uncertainty in ways that preserve epistemic honesty?
- Does pseudo-labeling from LLMs degrade classifier performance?
- How do language models track multiple negotiating parties' commitments simultaneously?
- Can readers detect meaning through resonance patterns alone without knowing authorial intent?
- Where does the meaning actually originate in reader-detected resonance across language?
- Can instruction prompts reliably steer an LLM judge toward specific alignment targets?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Do standard NLP benchmarks hide LLM ambiguity failures?
When benchmark creators filter out ambiguous examples before testing, do they accidentally make it impossible to measure whether language models can actually handle ambiguity the way humans do?
why this failure is normally invisible
-
Why do speakers deliberately use ambiguous language?
Explores whether ambiguity is a linguistic defect or a strategic tool speakers use for efficiency, politeness, and deniability. Matters because it challenges how we train language systems.
what is being failed at
-
Why do large language models fail at complex linguistic tasks?
Explores whether LLMs have inherent limitations in detecting fine-grained syntactic structures, especially embedded clauses and recursive patterns, and whether these failures are systematic rather than random.
this is one of the deepest blind spots
-
Why does ChatGPT fail at implicit discourse relations?
ChatGPT excels when discourse connectives are present but drops to 24% accuracy without them. What does this gap reveal about how LLMs actually process meaning and logical relationships?
ambiguity failure is another asymmetry: explicit = manageable, multiple interpretations = failure
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- We’re Afraid Language Models Aren’t Modeling Ambiguity
- Interpretation modeling: Social grounding of sentences by reasoning over their implicit moral judgments
- Aligning Language Models to Explicitly Handle Ambiguity
- Simple Linguistic Inferences of Large Language Models (LLMs): Blind Spots and Blinds
- Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey
- Word Meanings in Transformer Language Models
- Probing Structured Semantics Understanding and Generation of Language Models via Question Answering
- Beyond Single Models: Enhancing LLM Detection of Ambiguity in Requests through Debate
Original note title
llms fail at ambiguity recognition with gpt-4 achieving 32% correct disambiguations vs 90% for humans