Does AI struggle with poetry for the same reason it misses jokes?

This explores whether the breakdown AI shows with jokes and wordplay is the same underlying failure that shows up with poetry — and the corpus says yes: both trace back to how transformers combine words rather than to any gap in knowledge.

This explores whether AI's trouble with poetry is the same trouble it has with jokes — and the most striking thread in the collection says the two share one root cause, sitting in the machinery of how words get combined rather than in what the model knows. The clearest account is that transformers read words *additively* — they blend all the tokens in a sentence through weighted parallel aggregation, never sharply suppressing the words that don't belong Why do AI systems miss jokes and wordplay so consistently?. Human minds do the opposite: they hold a few frame-coherent words in tight resonance and quietly mute the linguistically adjacent ones, tracking what *fits the frame* rather than what merely co-occurs Does the mind selectively activate frames from only some words?. A joke's punchline and a poem's image both work by making one frame suddenly win out over a competing one — exactly the selective operation the machine doesn't perform.

So the answer is: it's the same missing cognitive move, surfacing in two genres. A pun forces a quick frame-switch; a line of poetry sustains two frames at once and asks you to feel the tension. Both demand that some words dominate and others recede — and an architecture that averages everything will flatten both into literal paraphrase. One study reframes metaphors, idioms, and puns as a single task — recovering the intended meaning from non-literal expression — and argues models need better *semantic decoupling*, the ability to peel intended sense away from surface words, not more examples of each category Can one model handle all types of figurative language?. That unification is the deeper version of your question: poetry, jokes, irony, and metaphor may all be one problem wearing different costumes.

There's a twist worth knowing. AI doesn't always *miss* the figurative — sometimes it over-detects. Asked to score irony, GPT-4o flags it far more often than humans do, because ironic examples loom large in training data even though they're rare in real use Do language models overestimate how often irony appears?. So the failure isn't blindness to non-literal language; it's *miscalibration* — the model recognizes the pattern but can't gauge when it's actually live. Poetry would stress this the same way a joke does: not 'is figurative language present?' but 'which reading should win here, and how strongly?'

The collection also hints at a related gap that poetry exposes even when frames aren't the issue. AI prose tends to be grammatically and organizationally fine but argumentatively inert — it masters structure while avoiding evaluative stance, leaning on neutral, descriptive language instead of words that carry a point of view Why does AI writing sound generic despite being grammatically correct?. Poetry lives almost entirely in stance, compression, and the weight a single word is asked to bear, so this 'rhetorical gap' compounds the frame problem rather than replacing it.

The thing you might not have expected: this isn't a knowledge deficit that more data fixes. The same line of work shows humans and models actually fail along *identical* content-sensitivity axes on reasoning tasks, so 'real understanding vs. pattern matching' is a shakier line than it looks Do language models fail reasoning tests that humans pass?. The poetry-and-jokes failure is sharper and more specific than 'AI doesn't get nuance' — it's a structural absence of selective frame activation, the same missing gear in both.

Sources 6 notes

Why do AI systems miss jokes and wordplay so consistently?

Transformers integrate token information through weighted parallel aggregation rather than selective suppression of irrelevant words. This structural difference explains consistent failures with jokes, wordplay, and frame-dependent meaning—not knowledge gaps, but missing cognitive operations.

Does the mind selectively activate frames from only some words?

Human meaning-making operates through selective frame activation: the mind holds frame-related words in tight resonance while ignoring linguistically adjacent but frame-unrelated words. This selectivity tracks frame-coherence, not co-occurrence frequency, and represents a cognitive operation that standard similarity computation cannot capture.

Can one model handle all types of figurative language?

The Diplomat dataset (4,177 dialogues) reframes metaphors, idioms, and puns as one pragmatic task: recovering literal meaning from non-literal expression. This framing suggests LLMs need better semantic decoupling ability, not more category-specific training data.

Do language models overestimate how often irony appears?

GPT-4o assigns significantly higher irony scores than humans (p < .001), revealing that LLMs detect irony as a pattern but miscalibrate its prevalence because ironic examples are more salient in training data than in actual use.

Why does AI writing sound generic despite being grammatically correct?

AI text uses manner nouns and anaphoric references that are descriptively neutral, while human writers use status and evidential nouns that carry evaluative weight. This produces organizationally coherent but argumentatively inert prose.

Do language models fail reasoning tests that humans pass?

Research shows both humans and LLMs succeed and fail along the same content-sensitivity axis in reasoning tasks like Wason tests and natural language inference. Content-independence is not a meaningful criterion for distinguishing real reasoning from pattern matching.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question remains open: Does AI struggle with poetry for the same reason it misses jokes—and if so, can that shared constraint be engineered away?

What a curated library found — and when (dated claims, not current truth):
Findings span 2022–2026; treat these as perishable claims needing re-test:
• Transformers blend tokens additively via parallel weighted aggregation, never sharply suppressing misfit words; humans selectively resonate frame-coherent words and mute the rest (2022–2024).
• Both jokes and poetry demand frame dominance and suppression—a selective operation absent in averaging architectures (2023–2024).
• AI over-detects irony rather than missing it: GPT-4o flags irony ~3× more often than humans due to training-data prevalence, a *calibration* not blindness problem (2025-01).
• LLM prose masters grammatical structure but lacks evaluative stance—the rhetoric-grammar gap compounds frame failure (2024–2025).
• Humans and models fail along *identical* content-sensitivity axes on reasoning, so symbolic-vs-connectionist framing is less clean than claimed (2022).

Anchor papers (verify; mind their dates):
• arXiv:2207.07051 (2022) — content effects on reasoning, humans ≈ models
• arXiv:2501.11241 (2025-01) — irony calibration bias in GPT-4o
• arXiv:2510.14665 (2025-10) — illusion of understanding in LLMs
• arXiv:2511.19757 (2025-11) — what language understanding means

Your task:
(1) RE-TEST the frame-selectivity constraint. Has any post-2024 work on attention masking, architectural sparsity (e.g., mixture-of-experts), or inference-time steering (e.g., SAE-based pruning, selective activation) *relaxed* the additive blending problem? Or does it still hold across decoder-only and encoder-decoder models? Separate the durable claim (frame-switching is hard) from the perishable one (transformers cannot do it).
(2) Surface the strongest *contradicting* work from the last 6 months: any findings that show AI *does* resolve poetry/joke comprehension without architectural change—e.g., via prompting, in-context examples, or post-hoc calibration—or that show the frame problem is *not* the bottleneck.
(3) Propose 2 research questions assuming the regime has moved: (a) If selective frame activation can be induced via inference tricks, does it generalize to novel poetic / ironic contexts, or does it remain brittle? (b) Does the over-calibration of irony (finding #3 above) apply equally to poetry, or do the two genres differ in how frequency biases distort judgment?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Does AI struggle with poetry for the same reason it misses jokes?

Sources 6 notes

Next inquiring lines