Why do LLMs fail at semantic generalization despite grammatical accuracy?

This explores a paradox the corpus keeps circling: models can be fluent and grammatical on the surface while failing to carry meaning across to new cases — and the collection suggests the cause is that 'understanding' and 'producing well-formed text' run on different machinery.

This explores why a model can sound grammatically right yet not generalize the meaning — and the corpus's answer is fairly blunt: grammatical fluency and semantic competence aren't the same skill, and LLMs largely have the first without the second. Several notes converge on the idea that what looks like grammar is actually surface pattern-matching that holds up until the structure gets hard. LLMs handle simple sentences cleanly but misidentify embedded clauses, recursion, and complex nominals, with performance degrading *predictably* as syntactic depth increases Why do large language models fail at complex linguistic tasks? Does LLM grammatical performance decline with structural complexity?. The tell is that the failures track structural complexity, not topic difficulty — a sign the model learned surface heuristics rather than the underlying rules Where exactly do language models fail at structural language tasks?.

The deeper mechanism shows up when you decouple form from meaning. When semantic content is stripped out of a reasoning task — correct rules supplied in context, but the familiar associations removed — LLM performance collapses Do large language models reason symbolically or semantically?. Models reason *through* semantic association, not symbolic logic, so they can't manipulate a rule abstractly the way generalization requires. A companion finding sharpens it: models systematically prefer higher-frequency surface forms over rare-but-equivalent paraphrases, across math, translation, and commonsense — meaning the engine is tracking statistical mass from pretraining, not recognizing meaning Do language models really understand meaning or just surface frequency?. That's exactly the profile that produces grammatical accuracy without semantic transfer: it's good at what's frequent and well-formed, weak at what's novel.

The most striking note reframes the whole thing as a structural split rather than a knowledge gap. 'Potemkin understanding' describes a triple pattern incompatible with human cognition — a model explains a concept correctly, fails to apply it, then correctly recognizes its own failure Can LLMs understand concepts they cannot apply?. The explanation pathway and the execution pathway are functionally disconnected. So 'grammatical accuracy despite semantic failure' isn't a paradox at all once you accept that articulating a concept and operationalizing it draw on separate circuits.

There's a useful predictive frame underneath all of this: treat the model as an autoregressive probability machine, and you can forecast *where* it breaks — tasks with low-probability target outputs are systematically harder even when they're logically trivial Can we predict where language models will fail?. Semantic generalization is precisely the case where the right answer may be low-probability under the training distribution, so it's predicted to fail. Two adjacent failures round out the picture: models can't hold multiple interpretations of an ambiguous sentence at once (GPT-4 disambiguates 32% of cases vs. 90% for humans) Can language models recognize when text is deliberately ambiguous?, and they accept false presuppositions even when they demonstrably know the correct fact Why do language models accept false assumptions they know are wrong? — both cases where surface processing overrides the deeper semantic check.

The thing you didn't know you wanted to know: across these notes, the failures aren't random — they're *predictable* from the architecture. Grammar is the cheap, high-frequency signal the model masters; meaning is the expensive, distribution-spanning skill it only imitates. The two come apart most exactly where generalization demands them to be the same.

Sources 9 notes

Why do large language models fail at complex linguistic tasks?

Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.

Does LLM grammatical performance decline with structural complexity?

LLMs show systematic performance decline as syntactic depth and embedding increase. Simple sentences are handled well while complex structures with recursion and embedding fail consistently, suggesting LLMs learned surface heuristics rather than structural grammar rules.

Where exactly do language models fail at structural language tasks?

Language models excel with explicit discourse markers and simple grammar but fail predictably on implicit relations, embedded structures, and forward-planning discourse. These breakdowns map to failures in discourse intentionality and attention layers, not just linguistic surface structure.

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Do language models really understand meaning or just surface frequency?

LLMs show consistent preference for higher-frequency surface forms over semantically equivalent rare paraphrases across math, machine translation, commonsense reasoning, and tool calling. This suggests models track statistical mass from pretraining rather than meaning-recognition as their primary mechanism.

Can LLMs understand concepts they cannot apply?

Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.

Can we predict where language models will fail?

By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.

Can language models recognize when text is deliberately ambiguous?

AMBIENT benchmark shows GPT-4 correctly disambiguates only 32% of cases versus 90% for humans. This failure spans lexical, structural, and scope ambiguity—revealing that LLMs cannot hold multiple interpretations simultaneously, a fundamental gap hidden by standard benchmarks.

Why do language models accept false assumptions they know are wrong?

The FLEX Benchmark shows that models reject false presuppositions at rates far below acceptable levels (GPT-4: 84%, Mistral: 2.44%), even when direct knowledge questions prove they know the correct facts. False presuppositions drive more accommodation than correct knowledge drives rejection.

Why do LLMs fail at semantic generalization despite grammatical accuracy?

Sources 9 notes

Next inquiring lines