INQUIRING LINE

How much semantic meaning survives when LLMs paraphrase poetry and literary text?

This explores what happens to the actual meaning of a poem or literary passage when an LLM rewrites it in different words — and the corpus suggests the surface mechanics survive while the deeper meaning leaks out.


This explores what survives when an LLM restates poetry or literary text in its own words — and the short answer the corpus converges on is: the machinery survives, the meaning doesn't. Several notes draw the same line in different places. LLMs are good at the dissectible, explicit layer of literature — metaphoric mappings, stylistic signatures, authorship fingerprints — but fail at the implicit, evaluative, ambiguous layer where literary meaning actually lives Can LLMs truly understand literary meaning or just mechanics?. Style detection saturates early (GPT-2 hits 95% on authorship from pattern alone) yet the model has no framework to say *why* those choices carry weight — detection without interpretation is cataloguing, not criticism Can language models truly understand literary style?.

The most concrete answer to 'how much survives' comes from the frequency work, which reveals a directional bias, not just random loss. LLMs systematically prefer high-frequency phrasings over rarer but equivalent ones, because they're tracking statistical mass from pretraining rather than recognizing meaning Do language models really understand meaning or just surface frequency?. The reason this matters for poetry is the second half of the mechanism: frequent words tend to be more abstract (general concepts outnumber specific ones), so a frequency-biased paraphrase drifts steadily toward abstraction and erases expert-level, fine-grained specificity Does word frequency correlate with semantic abstraction?. Poetry is precisely the genre that lives in the rare, specific, connotation-loaded word — so paraphrase pushes it toward the bland and general. 'Same meaning' prompts already produce different outputs for this reason; semantic equivalence is, in the corpus's blunt phrasing, a fiction Why do semantically identical prompts produce different LLM outputs?.

Two capacities poetry depends on are exactly where the models break. Ambiguity — holding several readings of a line at once — collapses: GPT-4 disambiguates only 32% of deliberately ambiguous cases versus 90% for humans, because it can't hold multiple interpretations simultaneously Can language models recognize when text is deliberately ambiguous?. And figurative language degrades along a spectrum: conventional, lexicalized metaphors paraphrase fine, but novel literary metaphors — the kind a poet invents — require genuine conceptual domain-mapping that pattern recognition can't do Where does LLM metaphor comprehension actually break down?. So the loss isn't uniform; it's heaviest exactly where the writing is most original.

There's a sharper framing worth pulling in from adjacent territory. One line of work reframes all figurative language — metaphor, idiom, pun — as a single pragmatic task: recovering literal meaning from non-literal expression Can one model handle all types of figurative language?. Note what that framing concedes: the success metric is *flattening* the non-literal into the literal. For poetry, the non-literal often *is* the meaning, so even a 'successful' paraphrase under this framing has discarded the thing you cared about. This connects to the deeper diagnosis that LLM understanding can be a 'potemkin' — correct explanation running on a pathway disconnected from correct application Can LLMs understand concepts they cannot apply?. A model can explain a poem's theme fluently and still produce a paraphrase that has quietly drained it.

The thing you might not have known you wanted to know: the loss is patterned and predictable, not noise. It flows in a specific direction — toward the frequent, the abstract, the literal, the single-reading — which means a paraphrase doesn't just lose meaning randomly, it loses meaning the way a photocopy of a photocopy loses contrast: specifics fade first, ambiguity gets resolved into one safe reading, and novel images get translated into conventional ones. If you want the grounding question underneath all this — whether a text-trained system can reach meaning anchored in lived human experience at all — the corpus's most generous answer is 'indirect causal grounding,' regularities extracted secondhand from causally grounded humans, with gaps Can large language models develop genuine world models without direct environmental contact?. For poetry, those gaps are the whole point.


Sources 10 notes

Can LLMs truly understand literary meaning or just mechanics?

LLMs successfully extract explicit literary features like metaphoric mappings and stylistic signatures. However, they systematically fail at implicit relations (24% accuracy), ambiguity recognition (32% vs 90% human), evaluative stance-taking, and preserving connotation—the core dimensions where literary meaning operates.

Can language models truly understand literary style?

GPT-2 achieves 95% accuracy identifying authorship through style patterns alone, but lacks the evaluative framework to explain why those stylistic choices carry meaning. Detection without interpretation remains cataloguing, not criticism.

Do language models really understand meaning or just surface frequency?

LLMs show consistent preference for higher-frequency surface forms over semantically equivalent rare paraphrases across math, machine translation, commonsense reasoning, and tool calling. This suggests models track statistical mass from pretraining rather than meaning-recognition as their primary mechanism.

Does word frequency correlate with semantic abstraction?

WordNet analysis shows hypernyms (general concepts) occur more frequently than hyponyms (specific ones). Combined with LLMs' frequency bias, this means preferring common paraphrases systematically drifts toward abstraction, erasing expert-level specificity.

Why do semantically identical prompts produce different LLM outputs?

Cao et al. and Adam's Law show that semantically identical prompts with different sentence-level frequencies produce systematically different output quality. Higher-frequency phrasings win because models register statistical mass from pre-training, not meaning.

Can language models recognize when text is deliberately ambiguous?

AMBIENT benchmark shows GPT-4 correctly disambiguates only 32% of cases versus 90% for humans. This failure spans lexical, structural, and scope ambiguity—revealing that LLMs cannot hold multiple interpretations simultaneously, a fundamental gap hidden by standard benchmarks.

Where does LLM metaphor comprehension actually break down?

LLMs handle conventional, lexicalized metaphors but fail on novel literary metaphors requiring conceptual domain mapping. This degradation reveals a fundamental gap between pattern recognition and genuine semantic mapping.

Can one model handle all types of figurative language?

The Diplomat dataset (4,177 dialogues) reframes metaphors, idioms, and puns as one pragmatic task: recovering literal meaning from non-literal expression. This framing suggests LLMs need better semantic decoupling ability, not more category-specific training data.

Can LLMs understand concepts they cannot apply?

Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.

Can large language models develop genuine world models without direct environmental contact?

LLMs form structured world representations by extracting regularities from training data produced by causally grounded humans. This constitutes indirect causal grounding mediated through text, though the chain has gaps that limit real-time verification and model updating.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about whether LLMs preserve semantic meaning when paraphrasing poetry and literary text. The question remains open: *what survives*?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026. The library converges on these constraints:
• LLMs detect authorial style at high accuracy (GPT-2: 95% on pattern alone) but cannot explain *why* choices carry weight — detection without interpretation (~2024).
• High-frequency words systematically replace rare equivalents in paraphrase; since frequent words are more abstract, paraphrases drift toward abstraction, erasing fine-grained specificity (~2025).
• GPT-4 disambiguates only 32% of deliberately ambiguous cases vs. 90% for humans; multiple simultaneous readings collapse (~2023).
• Novel literary metaphors degrade because they require genuine conceptual domain-mapping, whereas conventional lexicalized metaphors paraphrase reliably (~2025).
• Figurative language framing treats non-literal as a problem to flatten into literal; for poetry, the non-literal *is* the meaning (~2024).

Anchor papers (verify; mind their dates):
• arXiv:2304.14399 (2023) — ambiguity failure in LLMs
• arXiv:2505.21011 (2025) — frequency pattern learning in inference
• arXiv:2507.10354 (2025) — layered metaphor processing
• arXiv:2604.02176 (2026) — frequency law in LLMs

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, judge whether newer models, instruction-tuning, retrieval-augmented generation, or multi-pass reasoning (e.g., explicit ambiguity-preservation prompts, structured metaphor parsing) have since RELAXED or OVERTURNED the limitation. Separate the durable question — *can a next-token predictor access non-literal meaning?* — from the perishable claim about current model capacity. Cite what resolved each, plainly state where constraints still hold.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last 6 months. Look for papers claiming LLMs *do* preserve semantic nuance in paraphrase, or that figurative language is more tractable than the library suggests.
(3) Propose 2 research questions that ASSUME the regime may have shifted: e.g., whether fine-tuning on high-ambiguity corpora or explicit multi-interpretation training breaks the 32% ceiling; whether retrieval of contextual / historical usage patterns restores specificity loss.

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines