Why do AI systems miss jokes and wordplay so consistently?

Exploring whether AI's literal reading of language stems from how transformers process tokens in parallel rather than through selective frame-activation like humans do. Understanding this gap could reveal what cognitive operations current architectures lack.

Synthesis note · 2026-04-14

The transformer architecture processes a sequence of tokens through attention layers that compute relations across all token pairs. Information about the words is integrated, but the integration is parallel and additive — every token influences every other in proportion to the attention weights. There is no cognitive operation that suppresses some attention paths in order to surface the frame that holds a subset of tokens together. The mechanism does not do selective-resonance; it does weighted-aggregation.

This explains a recurring AI failure pattern. Given material that contains a frame activated by some words but not others, AI tends to read the material literally — taking each word at its compositional value rather than catching the frame the subset activates. The bullseye example illustrates: given "bullseye" applied to a design with a dot, a cover, and an arrow through it, AI reads "bullseye" as compliment-metaphor and misses the archery frame three of the four words activate. The miss is structural, not a knowledge gap. AI knows what "bullseye" is, knows what "arrow" is, knows what "dot" is. What it does not do is select these three for frame-activation while suppressing "cover."

This generalizes beyond wordplay. The same mechanism underlies AI difficulties with jokes (the punchline activates a frame that recontextualizes the setup), with poetry (image-clusters activate frames the literal words do not), with rhetoric (where a frame is built from selective material across a passage). Each of these depends on selective-resonance — the operation transformers do not perform. The miss is not "AI lacks world knowledge"; it is "AI lacks the selective-suppression operation that frame-activation requires."

Does the mind selectively activate frames from only some words? is the human-side companion. Together the two claims locate the difference precisely: not that AI lacks data or context, but that the cognitive operation human meaning-making relies on is not the operation transformers perform.

The strongest counterargument: better attention mechanisms, finer-grained attention heads, and explicit frame-extraction layers could close the gap. Possible but not yet evident. The gap appears even in the largest models with the most sophisticated attention, which suggests the operation needed is not just better attention but a different operation. Selective frame-activation may require something architecturally distinct from attention-as-weighted-aggregation.

Inquiring lines that use this note as a source 49

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

14 direct connections · 138 in 2-hop network ·dense cluster Open in graph ↗

Why do AI systems miss jokes and wordplay so con… Does the mind selectively activate frames from onl… How do readers actually build meaning from words? Why don't conversational AI systems mirror their u…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Why do AI systems miss jokes and wordplay so consistently?

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 4