Can LLMs improve at metaphor if they handle decoupled semantics better?

This explores whether LLMs' weakness at metaphor is really a symptom of a deeper problem — their difficulty separating a word's literal meaning from its statistical, frequency-driven surface form ("decoupled semantics").

This explores whether LLMs' weakness at metaphor is really a symptom of a deeper problem — their difficulty separating a word's literal meaning from its non-literal use, which researchers call decoupling semantics. The corpus suggests the answer is a qualified yes, but with a twist: metaphor failure isn't a standalone skill gap, it's the visible tip of how these models reason at all. One line of work reframes metaphor, idioms, and puns not as separate categories to be memorized but as a single pragmatic task — recovering literal meaning from non-literal expression — which implies that what LLMs lack is general semantic-decoupling ability rather than more metaphor-specific training data Can one model handle all types of figurative language?. If that's true, improving decoupling would lift metaphor along with everything else in the family.

But here's the catch the corpus keeps circling: LLMs may not have a clean "meaning" representation to decouple in the first place. When researchers strip the familiar semantic content out of reasoning tasks and leave the logical rules intact, model performance collapses — evidence that LLMs reason through semantic association, not symbolic manipulation Do large language models reason symbolically or semantically?. They also systematically prefer high-frequency phrasings over rarer but equivalent ones, suggesting they track statistical mass from pretraining more than meaning itself Do language models really understand meaning or just surface frequency?. Metaphor is precisely where this bites: novel literary metaphors are low-frequency and demand mapping one conceptual domain onto another, and that's exactly where comprehension degrades, while conventional, lexicalized metaphors (already baked into the training distribution) work fine Where does LLM metaphor comprehension actually break down?.

There's an even sharper diagnosis. Models can explain a concept correctly and then fail to apply it — a "potemkin" pattern where the explanation pathway and the execution pathway are functionally disconnected Can LLMs understand concepts they cannot apply?. So a model might define metaphor flawlessly and still mishandle a fresh one, because knowing-about and doing are wired separately. Relatedly, metaphor often requires holding the literal and figurative reading at once, and LLMs are strikingly bad at sustaining multiple interpretations — GPT-4 disambiguates only about a third of deliberately ambiguous cases versus ninety percent for humans Can language models recognize when text is deliberately ambiguous?. Decoupling semantics isn't just separating literal from figurative; it's keeping both live simultaneously, which the architecture resists.

Where the corpus gets generative is on what "better decoupling" might concretely look like. Pure relational compression of text is enough to learn fluent, culturally situated language without any external grounding Can language models learn meaning without engaging the world? — but fluency clearly isn't the same as the conceptual mapping novel metaphor needs. A more promising hint comes from work showing that partial symbolic augmentation beats both raw language and full formalization: selectively adding structure while preserving semantic richness yields the gains, because full formalization throws away the very nuance metaphor depends on Why does partial formalization outperform full symbolic logic?. And metaphor may not even live in the "reasoning" bucket current methods optimize — it leans on transformational and exploratory creative reasoning that existing LLM reasoning techniques simply don't target Can LLMs reason creatively beyond conventional problem-solving?.

The thing you might not have expected: chasing metaphor directly is probably the wrong move. The corpus points to metaphor as a stress test for a general capacity — separating meaning from statistical surface form while holding competing readings open — and the most credible levers (selective symbolic scaffolding, creative-reasoning paradigms, closing the explain-versus-apply gap) all aim at that underlying capacity rather than at metaphor itself. Improve decoupled semantics and metaphor improves as a side effect; train on metaphor alone and you mostly teach the model more conventional metaphors to pattern-match.

Sources 9 notes

Can one model handle all types of figurative language?

The Diplomat dataset (4,177 dialogues) reframes metaphors, idioms, and puns as one pragmatic task: recovering literal meaning from non-literal expression. This framing suggests LLMs need better semantic decoupling ability, not more category-specific training data.

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Do language models really understand meaning or just surface frequency?

LLMs show consistent preference for higher-frequency surface forms over semantically equivalent rare paraphrases across math, machine translation, commonsense reasoning, and tool calling. This suggests models track statistical mass from pretraining rather than meaning-recognition as their primary mechanism.

Where does LLM metaphor comprehension actually break down?

LLMs handle conventional, lexicalized metaphors but fail on novel literary metaphors requiring conceptual domain mapping. This degradation reveals a fundamental gap between pattern recognition and genuine semantic mapping.

Can LLMs understand concepts they cannot apply?

Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.

Can language models recognize when text is deliberately ambiguous?

AMBIENT benchmark shows GPT-4 correctly disambiguates only 32% of cases versus 90% for humans. This failure spans lexical, structural, and scope ambiguity—revealing that LLMs cannot hold multiple interpretations simultaneously, a fundamental gap hidden by standard benchmarks.

Can language models learn meaning without engaging the world?

Research shows LLMs learn culturally situated discourse patterns by compressing relational structure from text, demonstrating that fluent language generation requires no external referents or embodied grounding.

Why does partial formalization outperform full symbolic logic?

QuaSAR and Logic-of-Thought both achieve 4-8% accuracy gains by enriching natural language with selective symbolic elements rather than replacing it. Full formalization loses semantic information; pure language lacks structure. Augmentation preserves both.

Can LLMs reason creatively beyond conventional problem-solving?

Research identifies combinational, exploratory, and transformational reasoning as distinct creative modes grounded in cognitive science. Existing LLM reasoning methods address only conventional problem-solving, leaving creative paradigms unaddressed and potentially explaining diversity collapse in ideation.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing whether LLM metaphor comprehension improves via better decoupled semantics. The question remains open; treat these findings as dated claims (2023–2026) needing re-validation.

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026:
• LLMs fail to sustain multiple interpretations simultaneously; GPT-4 disambiguates only ~32% of deliberately ambiguous cases vs. ~90% for humans (2023).
• Models reason through semantic association, not symbolic manipulation; stripping semantic content while preserving logical rules causes performance collapse (2023).
• LLMs systematically prefer high-frequency phrasings over rarer equivalents, tracking statistical mass from pretraining rather than meaning itself (2025).
• "Potemkin" understanding: models explain concepts correctly but fail to apply them; explanation and execution pathways are functionally disconnected (2024).
• Partial symbolic abstraction (selective structure + semantic richness) outperforms both raw language and full formalization (2025).

Anchor papers (verify; mind their dates):
• 2304.14399 (Apr 2023): ambiguity failure in LLMs
• 2305.14825 (May 2023): semantic vs. symbolic reasoning
• 2502.12616 (Feb 2025): quasi-symbolic abstractions
• 2604.02176 (Apr 2026): frequency law in LLMs

Your task:
(1) RE-TEST EACH CONSTRAINT. For each finding, ask: have newer models (o1, Claude 4, Llama 3.3+), training methods (constitutional AI, structured RL), or tooling (semantic scaffolding, multi-turn decoupling) since relaxed or overturned it? Especially probe whether multi-step reasoning or agentic loops now sustain competing interpretations better. Separate the durable question (can LLMs truly decouple literal from figurative?) from the perishable limitation (which architecture/training resolved it?).

(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months—especially if newer systems achieve >70% on ambiguity tasks, or if in-context symbolic grounding has become standard.

(3) Propose 2 research questions that ASSUME the regime may have shifted: (a) Can agentic metaphor reasoning (iterative decoupling + symbol grounding loops) unlock novel literary metaphors? (b) Does scaling alone (larger models, longer context) now solve the explain-versus-apply gap for figurative language?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Can LLMs improve at metaphor if they handle decoupled semantics better?

Sources 9 notes

Next inquiring lines