Why do LLMs achieve only 24 percent accuracy on implicit discourse relations?
This explores why LLMs collapse to roughly chance-level accuracy when discourse relations (cause, contrast, elaboration) must be inferred rather than read off explicit connective words like 'because' or 'but.'
This explores why LLMs hit only ~24% accuracy on *implicit* discourse relations — and the short answer the corpus offers is that they were never doing discourse reasoning in the first place. The headline finding is that LLM discourse competence is asymmetric: ChatGPT handles explicit relations well but craters to 24.54% when the connective is removed Why does ChatGPT fail at implicit discourse relations?. The connective word ('because,' 'although,' 'so') was doing the work. Strip it away and the model has to infer the relationship from the semantic content of two clauses — and that's the thing it can't actually do. The accuracy doesn't drop because the task got slightly harder; it drops because the surface signal the model was secretly relying on disappeared.
The same pattern shows up across the whole linguistic-competence cluster, which is what makes this more than a one-benchmark quirk. LLMs make systematic errors that worsen predictably as structure deepens — they misidentify embedded clauses and complex nominals, and performance degrades as syntactic depth increases Why do large language models fail at complex linguistic tasks? Does LLM grammatical performance decline with structural complexity?. The unifying diagnosis is that statistical learning captures surface patterns, not deep grammatical or relational rules. A broader map of these failures places implicit discourse alongside embedded structures and forward-planning discourse, locating the breakdown in discourse intentionality and attention layers rather than mere vocabulary Where exactly do language models fail at structural language tasks?. Implicit relations are simply the case where the surface crutch is most cleanly removed, so the gap is most visible.
Here's the thing you might not expect: this same explicit-vs-implicit split predicts which *reasoning* tasks LLMs are good at. Causal reasoning beats temporal reasoning in LLMs for exactly the same reason — causal connectives are explicit and frequent in training text, while temporal order is usually left implicit and must be inferred from context Why do LLMs handle causal reasoning better than temporal reasoning?. So the 24% number isn't really about discourse parsing as a niche linguistics task. It's a clean instance of a general law: wherever the answer is lexically marked on the surface, LLMs look competent; wherever it has to be inferred, they fall toward chance.
If you want to go further into where this 'surface signal substitutes for understanding' story leads, the corpus has adjacent failure modes worth pulling on. LLMs can't hold multiple interpretations of genuinely ambiguous text at once — GPT-4 disambiguates only 32% of cases versus 90% for humans Can language models recognize when text is deliberately ambiguous? — which is the same inability to do inference that isn't anchored to an explicit cue. And the limits aren't always about missing knowledge: in conversation, models fail to update shared common ground or reject false presuppositions even when they *know* the right answer Can LLMs truly update shared conversational common ground? Why do language models accept false assumptions they know are wrong?. Taken together, the lesson is that fluency over explicit signals is a poor proxy for the structural and inferential competence we assume it implies — and benchmarks built on explicit cues will keep hiding that gap.
Sources 8 notes
ChatGPT performs well on explicit discourse relations with connectives but achieves only 24.54% accuracy on implicit relations without them. This asymmetry reveals that LLMs rely on surface signals rather than inferring meaning from semantic content.
Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.
LLMs show systematic performance decline as syntactic depth and embedding increase. Simple sentences are handled well while complex structures with recursion and embedding fail consistently, suggesting LLMs learned surface heuristics rather than structural grammar rules.
Language models excel with explicit discourse markers and simple grammar but fail predictably on implicit relations, embedded structures, and forward-planning discourse. These breakdowns map to failures in discourse intentionality and attention layers, not just linguistic surface structure.
ChatGPT excels at causal relations but struggles with temporal ordering because causal connectives are explicit and frequent in training data, while temporal order is often implicit and must be inferred contextually.
AMBIENT benchmark shows GPT-4 correctly disambiguates only 32% of cases versus 90% for humans. This failure spans lexical, structural, and scope ambiguity—revealing that LLMs cannot hold multiple interpretations simultaneously, a fundamental gap hidden by standard benchmarks.
LLMs interpret all subsequent conversational turns within a fixed initial prompt frame, preventing them from symmetrically proposing updates to shared assumptions. Even when users pivot topics or contradict earlier framings, the model cannot absorb revisions into jointly held background—making the user the sole maintainer of conversational scoreboard.
The FLEX Benchmark shows that models reject false presuppositions at rates far below acceptable levels (GPT-4: 84%, Mistral: 2.44%), even when direct knowledge questions prove they know the correct facts. False presuppositions drive more accommodation than correct knowledge drives rejection.