Why do large language models fail at complex linguistic tasks?

Explores whether LLMs have inherent limitations in detecting fine-grained syntactic structures, especially embedded clauses and recursive patterns, and whether these failures are systematic rather than random.

Synthesis note · 2026-02-21 · sourced from Discourses

LLMs demonstrate "limited efficacy" on fine-grained linguistic annotation tasks, and the failures are not random — they are systematic and they get worse as input structural complexity increases.

The specific errors documented in Llama3-70b (one of the most capable models tested):

Misidentifying embedded clauses
Failing to recognize verb phrases
Confusing complex nominals with clauses

The research examined three questions: (1) accuracy on complex linguistic structure detection, (2) which structures are LLM blind spots, (3) how performance varies with linguistic complexity. The answers: accuracy is notably limited, complex syntactic structures (especially embedded/recursive ones) are the consistent blind spots, and performance degrades predictably with structural depth.

This matters because it reveals where statistical language learning diverges from grammatical competence. LLMs trained on vast corpora learn strong surface-level patterns, but the patterns do not reliably encode the deep structural rules that govern syntax. The model knows that a sentence has a verb, but cannot reliably identify the verb phrase when the structural context is complex.

The implication for LLM deployment in NLP pipelines: any application relying on fine-grained linguistic annotation — parsing, dependency analysis, argument structure detection — cannot treat LLMs as structurally reliable without auditing their performance on complex inputs. The failures are not edge cases; they are structurally determined by input complexity.

Inquiring lines that use this note as a source 160

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

13 direct connections · 89 in 2-hop network ·medium cluster Open in graph ↗

Why do large language models fail at complex lin… Does LLM grammatical performance decline with stru… What three layers must discourse systems actually … Why does ChatGPT fail at implicit discourse relati…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does LLM grammatical performance decline with structural complexity? This explores whether LLMs fail uniformly at grammar or whether their failures follow a predictable pattern tied to input complexity. Understanding the relationship matters for deciding when LLM annotations are reliable.
the specific inverse relationship
What three layers must discourse systems actually track? Grosz and Sidner's 1986 framework proposes that discourse requires simultaneously tracking linguistic segments, speaker purposes, and salient objects. Understanding why all three are necessary helps explain where current AI systems structurally fail.
the structural competence that LLMs' annotation failures suggest is missing
Why does ChatGPT fail at implicit discourse relations? ChatGPT excels when discourse connectives are present but drops to 24% accuracy without them. What does this gap reveal about how LLMs actually process meaning and logical relationships?
parallel finding: LLMs rely on surface cues rather than structural understanding

Why do large language models fail at complex linguistic tasks?

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 4