Does AI struggle with poetry for the same reason it misses jokes?
This explores whether the breakdown AI shows with jokes and wordplay is the same underlying failure that shows up with poetry — and the corpus says yes: both trace back to how transformers combine words rather than to any gap in knowledge.
This explores whether AI's trouble with poetry is the same trouble it has with jokes — and the most striking thread in the collection says the two share one root cause, sitting in the machinery of how words get combined rather than in what the model knows. The clearest account is that transformers read words *additively* — they blend all the tokens in a sentence through weighted parallel aggregation, never sharply suppressing the words that don't belong Why do AI systems miss jokes and wordplay so consistently?. Human minds do the opposite: they hold a few frame-coherent words in tight resonance and quietly mute the linguistically adjacent ones, tracking what *fits the frame* rather than what merely co-occurs Does the mind selectively activate frames from only some words?. A joke's punchline and a poem's image both work by making one frame suddenly win out over a competing one — exactly the selective operation the machine doesn't perform.
So the answer is: it's the same missing cognitive move, surfacing in two genres. A pun forces a quick frame-switch; a line of poetry sustains two frames at once and asks you to feel the tension. Both demand that some words dominate and others recede — and an architecture that averages everything will flatten both into literal paraphrase. One study reframes metaphors, idioms, and puns as a single task — recovering the intended meaning from non-literal expression — and argues models need better *semantic decoupling*, the ability to peel intended sense away from surface words, not more examples of each category Can one model handle all types of figurative language?. That unification is the deeper version of your question: poetry, jokes, irony, and metaphor may all be one problem wearing different costumes.
There's a twist worth knowing. AI doesn't always *miss* the figurative — sometimes it over-detects. Asked to score irony, GPT-4o flags it far more often than humans do, because ironic examples loom large in training data even though they're rare in real use Do language models overestimate how often irony appears?. So the failure isn't blindness to non-literal language; it's *miscalibration* — the model recognizes the pattern but can't gauge when it's actually live. Poetry would stress this the same way a joke does: not 'is figurative language present?' but 'which reading should win here, and how strongly?'
The collection also hints at a related gap that poetry exposes even when frames aren't the issue. AI prose tends to be grammatically and organizationally fine but argumentatively inert — it masters structure while avoiding evaluative stance, leaning on neutral, descriptive language instead of words that carry a point of view Why does AI writing sound generic despite being grammatically correct?. Poetry lives almost entirely in stance, compression, and the weight a single word is asked to bear, so this 'rhetorical gap' compounds the frame problem rather than replacing it.
The thing you might not have expected: this isn't a knowledge deficit that more data fixes. The same line of work shows humans and models actually fail along *identical* content-sensitivity axes on reasoning tasks, so 'real understanding vs. pattern matching' is a shakier line than it looks Do language models fail reasoning tests that humans pass?. The poetry-and-jokes failure is sharper and more specific than 'AI doesn't get nuance' — it's a structural absence of selective frame activation, the same missing gear in both.
Sources 6 notes
Transformers integrate token information through weighted parallel aggregation rather than selective suppression of irrelevant words. This structural difference explains consistent failures with jokes, wordplay, and frame-dependent meaning—not knowledge gaps, but missing cognitive operations.
Human meaning-making operates through selective frame activation: the mind holds frame-related words in tight resonance while ignoring linguistically adjacent but frame-unrelated words. This selectivity tracks frame-coherence, not co-occurrence frequency, and represents a cognitive operation that standard similarity computation cannot capture.
The Diplomat dataset (4,177 dialogues) reframes metaphors, idioms, and puns as one pragmatic task: recovering literal meaning from non-literal expression. This framing suggests LLMs need better semantic decoupling ability, not more category-specific training data.
GPT-4o assigns significantly higher irony scores than humans (p < .001), revealing that LLMs detect irony as a pattern but miscalibrate its prevalence because ironic examples are more salient in training data than in actual use.
AI text uses manner nouns and anaphoric references that are descriptively neutral, while human writers use status and evidential nouns that carry evaluative weight. This produces organizationally coherent but argumentatively inert prose.
Research shows both humans and LLMs succeed and fail along the same content-sensitivity axis in reasoning tasks like Wason tests and natural language inference. Content-independence is not a meaningful criterion for distinguishing real reasoning from pattern matching.