Does LLM grammatical performance decline with structural complexity?

This explores whether LLMs fail uniformly at grammar or whether their failures follow a predictable pattern tied to input complexity. Understanding the relationship matters for deciding when LLM annotations are reliable.

Synthesis note · 2026-02-21 · sourced from Discourses

The finding from the LLM linguistic blind spots study is not simply "LLMs are bad at grammar." It is more precise: performance degrades as a function of structural complexity. Simple cases (single-clause sentences, surface noun identification) may be handled well. Complex cases (embedded clauses, recursive structures, complex nominals that look like clauses) fail systematically.

This is a useful calibration for practitioners because it makes failures predictable. You can audit task complexity before deciding whether to trust LLM annotation output. If the task involves syntactically simple inputs with explicit structural markers, LLM performance may be acceptable. If inputs contain embedded clauses, recursive modification, or other depth-increasing structures, expect systematic errors.

The inverse correlation between structural complexity and performance also has theoretical significance: it suggests that what LLMs learned from training data is more like a frequency-weighted surface heuristic than a recursive structural grammar. Complex structures are rare in training corpora, so the heuristics generalize poorly to them. The model can get the easy cases right without having internalized the underlying rule.

The practical design implication: for any application where structural correctness matters, build complexity-stratified evaluation sets. Testing only on typical (simple) inputs overestimates competence. The failure mode is in the structural tail.

Entailment reasoning extends this pattern to a new domain. Why do embedding contexts confuse LLM entailment predictions? identifies a specific structural complexity type: when premises are embedded under presupposition triggers (factive verbs, temporal clauses) or non-factive verbs, LLMs cannot discriminate the opposite effects these contexts should produce. The structural packaging overwhelms the semantic content. This is a direct instantiation of the complexity-degradation pattern: embedding contexts add structural depth, and LLMs respond to the embedding verb as a surface cue rather than computing its effect on the embedded content's entailment relations.

Inquiring lines that use this note as a source 79

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

14 direct connections · 121 in 2-hop network ·medium cluster Open in graph ↗

Does LLM grammatical performance decline with st… Why do large language models fail at complex lingu… Can models pass tests while missing the actual gra… Why do embedding contexts confuse LLM entailment p…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Why do large language models fail at complex linguistic tasks? Explores whether LLMs have inherent limitations in detecting fine-grained syntactic structures, especially embedded clauses and recursive patterns, and whether these failures are systematic rather than random.
the broader finding this belongs to
Can models pass tests while missing the actual grammar? Do language models succeed on grammatical benchmarks by learning surface patterns rather than structural rules? This matters because correct outputs may hide reliance on shallow heuristics that fail on novel structures.
the BabyLM parallel: surface heuristics pass easy tests while deeper rules are absent
Why do embedding contexts confuse LLM entailment predictions? Can language models distinguish between contexts that preserve versus cancel entailments? The study explores whether LLMs systematically fail to apply the semantic rules governing presupposition triggers and non-factive verbs.
embedding contexts as a specific structural complexity type in entailment; surface cue response substitutes for semantic computation

Does LLM grammatical performance decline with structural complexity?

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 4