Why do language models fail at implicit discourse relations while handling explicit connectives?

This explores why LLMs handle discourse relations marked by words like 'because' or 'however' but collapse when the same relationship is left unstated — and what that gap reveals about how they actually process meaning.

This explores why LLMs handle discourse relations marked by explicit words like 'because' or 'however' but collapse when the same relationship is left unstated — and the corpus points to a single root cause: these models lean on surface signals in the text rather than inferring meaning from semantic content. The sharpest evidence is direct: ChatGPT performs well on explicit discourse relations but drops to roughly 24% accuracy on implicit ones, where no connective is present to lean on Why does ChatGPT fail at implicit discourse relations?. When the cue word is there, the model looks competent; remove it and ask the model to infer the relationship, and the competence largely evaporates. The asymmetry isn't about discourse being 'hard' — it's about whether the answer is sitting in the surface form.

The same pattern shows up in a neighboring domain, which is what makes it feel like a structural fact rather than a quirk. LLMs reason about cause better than time, and for the same reason: causal connectives ('because', 'therefore') are explicit and frequent in training text, while temporal ordering is usually left implicit and must be reconstructed from context Why do LLMs handle causal reasoning better than temporal reasoning?. Across both cases the rule is the same — wherever a relationship is signposted by a frequent surface marker, models excel; wherever it has to be inferred, they stumble. This generalizes into a broader map of where LLMs break: they handle explicit discourse markers and simple grammar but fail predictably on implicit relations, embedded structures, and forward-planning discourse, with the breakdowns tracing back to discourse intentionality and attention layers, not just surface vocabulary Where exactly do language models fail at structural language tasks?. And those failures worsen in a predictable way as syntactic depth increases — statistical learning captures surface patterns but not deep grammatical rules Why do large language models fail at complex linguistic tasks?.

There's a deeper framing worth pulling in. One line of work argues LLMs operationalize Saussure's notion of *langue* — they learn meaning purely by compressing relational structure from text, with no external referent or grounding Can language models learn meaning without engaging the world?. That's exactly why explicit connectives are a crutch and implicit relations are a cliff: a connective is itself a piece of surface structure the model can compress and predict, whereas an implicit relation requires inferring something that was never written down. A purely relational, text-internal system is well-equipped for the former and structurally underpowered for the latter.

What's surprising — and where this stops being a story about deficits — is that the failure may be about how the model is *prompted to compute*, not what it can compute. When o1 is pushed to reason step by step, it can build syntactic trees and produce genuine metalinguistic analysis, going well beyond surface behavioral tasks Can language models actually analyze language structure?. The implicit-relation failure looks like a failure of fast, single-pass pattern-matching; explicit reasoning can recover some of the structural inference that the default mode skips. That dovetails with the finding that reasoning failures are often driven by instance-level unfamiliarity rather than true complexity — models fit patterns from similar instances rather than running a general algorithm Do language models fail at reasoning due to complexity or novelty?. An implicit discourse relation is precisely the case with no surface pattern to match against.

So the answer the corpus leaves you with is more interesting than 'models are bad at hard tasks.' Explicit connectives let a relational, surface-compressing system shortcut the work; implicit relations force the inference the system normally avoids — and that gap can be partly narrowed by forcing the model to reason explicitly rather than predict in one pass. The thing you didn't know you wanted to know: the same mechanism that makes 'because' easy is what makes silence hard.

Sources 7 notes

Why does ChatGPT fail at implicit discourse relations?

ChatGPT performs well on explicit discourse relations with connectives but achieves only 24.54% accuracy on implicit relations without them. This asymmetry reveals that LLMs rely on surface signals rather than inferring meaning from semantic content.

Why do LLMs handle causal reasoning better than temporal reasoning?

ChatGPT excels at causal relations but struggles with temporal ordering because causal connectives are explicit and frequent in training data, while temporal order is often implicit and must be inferred contextually.

Where exactly do language models fail at structural language tasks?

Language models excel with explicit discourse markers and simple grammar but fail predictably on implicit relations, embedded structures, and forward-planning discourse. These breakdowns map to failures in discourse intentionality and attention layers, not just linguistic surface structure.

Why do large language models fail at complex linguistic tasks?

Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.

Can language models learn meaning without engaging the world?

Research shows LLMs learn culturally situated discourse patterns by compressing relational structure from text, demonstrating that fluent language generation requires no external referents or embodied grounding.

Can language models actually analyze language structure?

OpenAI's o1 model successfully constructs syntactic trees and phonological generalizations through explicit step-by-step reasoning, revealing that LLM linguistic capability extends far beyond behavioral language tasks to genuine language analysis.

Do language models fail at reasoning due to complexity or novelty?

LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.

Why do language models fail at implicit discourse relations while handling explicit connectives?

Sources 7 notes

Next inquiring lines