Why do LLMs achieve only 24 percent accuracy on implicit discourse relations?

This explores why LLMs collapse to roughly chance-level accuracy when discourse relations (cause, contrast, elaboration) must be inferred rather than read off explicit connective words like 'because' or 'but.'

This explores why LLMs hit only ~24% accuracy on *implicit* discourse relations — and the short answer the corpus offers is that they were never doing discourse reasoning in the first place. The headline finding is that LLM discourse competence is asymmetric: ChatGPT handles explicit relations well but craters to 24.54% when the connective is removed Why does ChatGPT fail at implicit discourse relations?. The connective word ('because,' 'although,' 'so') was doing the work. Strip it away and the model has to infer the relationship from the semantic content of two clauses — and that's the thing it can't actually do. The accuracy doesn't drop because the task got slightly harder; it drops because the surface signal the model was secretly relying on disappeared.

The same pattern shows up across the whole linguistic-competence cluster, which is what makes this more than a one-benchmark quirk. LLMs make systematic errors that worsen predictably as structure deepens — they misidentify embedded clauses and complex nominals, and performance degrades as syntactic depth increases Why do large language models fail at complex linguistic tasks? Does LLM grammatical performance decline with structural complexity?. The unifying diagnosis is that statistical learning captures surface patterns, not deep grammatical or relational rules. A broader map of these failures places implicit discourse alongside embedded structures and forward-planning discourse, locating the breakdown in discourse intentionality and attention layers rather than mere vocabulary Where exactly do language models fail at structural language tasks?. Implicit relations are simply the case where the surface crutch is most cleanly removed, so the gap is most visible.

Here's the thing you might not expect: this same explicit-vs-implicit split predicts which *reasoning* tasks LLMs are good at. Causal reasoning beats temporal reasoning in LLMs for exactly the same reason — causal connectives are explicit and frequent in training text, while temporal order is usually left implicit and must be inferred from context Why do LLMs handle causal reasoning better than temporal reasoning?. So the 24% number isn't really about discourse parsing as a niche linguistics task. It's a clean instance of a general law: wherever the answer is lexically marked on the surface, LLMs look competent; wherever it has to be inferred, they fall toward chance.

If you want to go further into where this 'surface signal substitutes for understanding' story leads, the corpus has adjacent failure modes worth pulling on. LLMs can't hold multiple interpretations of genuinely ambiguous text at once — GPT-4 disambiguates only 32% of cases versus 90% for humans Can language models recognize when text is deliberately ambiguous? — which is the same inability to do inference that isn't anchored to an explicit cue. And the limits aren't always about missing knowledge: in conversation, models fail to update shared common ground or reject false presuppositions even when they *know* the right answer Can LLMs truly update shared conversational common ground? Why do language models accept false assumptions they know are wrong?. Taken together, the lesson is that fluency over explicit signals is a poor proxy for the structural and inferential competence we assume it implies — and benchmarks built on explicit cues will keep hiding that gap.

Sources 8 notes

Why does ChatGPT fail at implicit discourse relations?

ChatGPT performs well on explicit discourse relations with connectives but achieves only 24.54% accuracy on implicit relations without them. This asymmetry reveals that LLMs rely on surface signals rather than inferring meaning from semantic content.

Why do large language models fail at complex linguistic tasks?

Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.

Does LLM grammatical performance decline with structural complexity?

LLMs show systematic performance decline as syntactic depth and embedding increase. Simple sentences are handled well while complex structures with recursion and embedding fail consistently, suggesting LLMs learned surface heuristics rather than structural grammar rules.

Where exactly do language models fail at structural language tasks?

Language models excel with explicit discourse markers and simple grammar but fail predictably on implicit relations, embedded structures, and forward-planning discourse. These breakdowns map to failures in discourse intentionality and attention layers, not just linguistic surface structure.

Why do LLMs handle causal reasoning better than temporal reasoning?

ChatGPT excels at causal relations but struggles with temporal ordering because causal connectives are explicit and frequent in training data, while temporal order is often implicit and must be inferred contextually.

Can language models recognize when text is deliberately ambiguous?

AMBIENT benchmark shows GPT-4 correctly disambiguates only 32% of cases versus 90% for humans. This failure spans lexical, structural, and scope ambiguity—revealing that LLMs cannot hold multiple interpretations simultaneously, a fundamental gap hidden by standard benchmarks.

Can LLMs truly update shared conversational common ground?

LLMs interpret all subsequent conversational turns within a fixed initial prompt frame, preventing them from symmetrically proposing updates to shared assumptions. Even when users pivot topics or contradict earlier framings, the model cannot absorb revisions into jointly held background—making the user the sole maintainer of conversational scoreboard.

Why do language models accept false assumptions they know are wrong?

The FLEX Benchmark shows that models reject false presuppositions at rates far below acceptable levels (GPT-4: 84%, Mistral: 2.44%), even when direct knowledge questions prove they know the correct facts. False presuppositions drive more accommodation than correct knowledge drives rejection.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst tracking LLM discourse competence. The question: why do LLMs achieve only ~24% accuracy on implicit discourse relations, and does that constraint still hold?

What a curated library found — and when (findings span 2023–2025; these are dated claims, not current truth):
• LLM discourse competence is asymmetric: explicit connectives ('because,' 'although') enable near-human performance; removing them drops accuracy to ~24.54% because models rely on surface lexical signals, not semantic inference (2024).
• Performance degrades predictably as syntactic structural complexity increases; LLMs misidentify embedded clauses and fail at forward-planning discourse intentionality (2025).
• The same explicit-vs-implicit split predicts reasoning tasks: causal reasoning (explicit connectives in training) beats temporal reasoning (implicit, context-dependent) in LLMs (2025).
• LLMs disambiguate only 32% of genuinely ambiguous text versus 90% for humans; they cannot hold multiple interpretations or update shared common ground in conversation (2024–2025).
• Models fail to reject false presuppositions even when the correct knowledge is present (2025).

Anchor papers (verify; mind their dates):
• arXiv:2310.17591 (Oct 2023) — Lil-Bevo: humanlike training strategies
• arXiv:2503.19260 (Mar 2025) — Linguistic Blind Spots of Large Language Models
• arXiv:2502.10215 (Feb 2025) — Do Large Language Models Reason Causally Like Us?
• arXiv:2505.22354 (May 2025) — LLMs Struggle to Reject False Presuppositions

Your task:
(1) RE-TEST THE SURFACE-SIGNAL HYPOTHESIS. For each finding above, determine whether newer training regimens (instruction-tuning variants, chain-of-thought scaffolding, multi-token reasoning), architectural changes (sparse attention, memory augmentation, explicit discourse layers), or eval harnesses have since relaxed the 24% floor on implicit relations or the asymmetry between explicit and implicit. Separate: Is the durable question "do LLMs reason about discourse structure without surface cues?" still open? Has the perishable limitation (24% accuracy) been overcome?
(2) Surface the strongest CONTRADICTING work from the last 6 months — papers claiming LLMs *do* infer implicit relations, or showing the explicit-vs-implicit gap is narrowing or already closed in specialized models.
(3) Propose 2 research questions that assume the regime may have shifted: e.g., "If newer models close the implicit-relation gap via architectural or training innovation, does that capability transfer to ambiguity and false-presupposition rejection?" and "Does the explicit-vs-implicit split still predict causal vs. temporal reasoning in models trained with discourse-aware objectives?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Why do LLMs achieve only 24 percent accuracy on implicit discourse relations?

Sources 8 notes

Next inquiring lines