What makes ambiguity recognition fundamentally important for poetry analysis?
This explores why the ability to hold multiple valid interpretations at once — ambiguity recognition — sits at the very center of what poetry analysis is, and why machines that can't do it can describe a poem without ever reading it.
This explores why ambiguity recognition isn't a side skill in poetry analysis but the load-bearing one — and the corpus makes a surprisingly sharp case for it. The starting point is that poems mean more than one thing on purpose. Ambiguity turns out to be a deliberate design feature of language rather than a defect to clean up: speakers and writers exploit it to be efficient, to be politely indirect, and to leave room for plausible deniability Why do speakers deliberately use ambiguous language?. Poetry is the place where this feature is dialed to maximum. So if you can't recognize that a line is doing two things at once, you haven't simplified the poem — you've deleted the thing that makes it a poem.
The reason this matters analytically is that interpretations of the same words are *irreducibly multiple*, and that multiplicity carries real information rather than noise. Readers disagree on socially loaded sentences not because some are wrong, but because the spread of readings is itself meaningful data about the text Why do readers interpret the same sentence so differently?. Poetry analysis is largely the practice of mapping that spread — naming the tension between readings rather than collapsing it to one. Recognizing ambiguity is the entry ticket to that whole enterprise.
Here's the part you might not expect: machines are catastrophically bad at exactly this step, and that failure exposes what analysis actually requires. On the AMBIENT benchmark, GPT-4 correctly disambiguates only 32% of cases against 90% for humans — it cannot hold two interpretations in mind simultaneously Can language models recognize when text is deliberately ambiguous?. A study of literary reading finds the same fault line: LLMs happily extract the *mechanics* — metaphor mappings, stylistic signatures, authorship — but collapse at ambiguity recognition, implicit relations, evaluative stance, and connotation, which is precisely where literary meaning lives Can LLMs truly understand literary meaning or just mechanics?. Style detection saturates early and easily; a model can nail authorship from surface patterns at 95% while having no framework for *why* those choices carry meaning Can language models truly understand literary style?. The lesson cuts both ways — detection without interpretation is cataloguing, not criticism, and the dividing line between them is ambiguity.
There's a quieter mechanism behind the failure worth knowing about. Models tend to track statistical mass from training rather than meaning — given two phrasings of the same idea, they systematically prefer the more frequent surface form regardless of sense Do language models really understand meaning or just surface frequency?. Poetry works by doing the opposite: choosing the rare, the marked, the surprising phrasing precisely *because* it forces a second reading. A system biased toward the high-frequency path is structurally pointed away from the poetic one. And our evaluation habits hide all of this — standard NLP benchmarks routinely filter out the examples where annotators disagree, which quietly removes the very cases that would expose ambiguity failures Do standard NLP benchmarks hide LLM ambiguity failures?.
The hopeful coda is that ambiguity recognition can be *engineered* when you stop treating it as a single confident guess. A leader-follower debate protocol — one model proposes interpretations, others challenge them, roles rotating — pushed a small 7B model to 76.7% ambiguity detection, because forcing competing readings into the open mimics what a careful reader does Can structured debate roles help small models detect ambiguity?. That's the deepest thing the corpus offers about poetry: recognizing ambiguity isn't choosing the right meaning, it's refusing to choose too early — holding the interpretations in productive tension, which is exactly what both the multi-reader and the multi-agent approaches formalize.
Sources 8 notes
Research shows speakers exploit ambiguity to balance efficiency against clarity, enable polite indirection, and permit plausible deniability. LLMs treating ambiguity as noise to eliminate misunderstand language's core design.
Interpretation Modeling research shows that disagreement on socially embedded sentences reflects valid differences in reader perspective, not annotation failure. Structured human disagreement in NLI benchmarks confirms that interpretation distributions carry meaningful information.
AMBIENT benchmark shows GPT-4 correctly disambiguates only 32% of cases versus 90% for humans. This failure spans lexical, structural, and scope ambiguity—revealing that LLMs cannot hold multiple interpretations simultaneously, a fundamental gap hidden by standard benchmarks.
LLMs successfully extract explicit literary features like metaphoric mappings and stylistic signatures. However, they systematically fail at implicit relations (24% accuracy), ambiguity recognition (32% vs 90% human), evaluative stance-taking, and preserving connotation—the core dimensions where literary meaning operates.
GPT-2 achieves 95% accuracy identifying authorship through style patterns alone, but lacks the evaluative framework to explain why those stylistic choices carry meaning. Detection without interpretation remains cataloguing, not criticism.
LLMs show consistent preference for higher-frequency surface forms over semantically equivalent rare paraphrases across math, machine translation, commonsense reasoning, and tool calling. This suggests models track statistical mass from pretraining rather than meaning-recognition as their primary mechanism.
By filtering out examples where annotators disagree, benchmarks remove test cases that would reveal LLM failures at ambiguity recognition. Research using ambiguous examples shows a 32% vs. 90% accuracy gap invisible to standard evaluation.
Mistral-7B achieved 76.7% accuracy in ambiguity detection through a protocol where a leader proposes interpretations and two followers challenge them with rotating roles. Role rotation and consensus forcing prevent persuasive framing failures and create stronger verification than pairwise debate.