Can language models distinguish explicit from implicit discourse relations?
This explores whether language models genuinely understand how sentences connect to each other, or whether they only spot the connection when an explicit word like 'because' or 'however' is sitting there to tip them off.
This explores whether LLMs can tell apart discourse relations that are marked by an explicit connective from those left implicit — and the corpus has a sharp, almost uncomfortable answer: the distinction barely exists for the model, because it leans on the connective and largely can't do the work without it. The most direct evidence is striking — ChatGPT handles explicit relations well but collapses to about 24.5% accuracy on implicit ones, where no connective spells out the link Why does ChatGPT fail at implicit discourse relations?. The takeaway isn't that the model is bad at one task; it's that its apparent competence on the explicit case was riding on a surface signal, not on inferring meaning from what the sentences actually say.
The same pattern shows up from a different angle in how models handle causal versus temporal reasoning. Causal links come with frequent, explicit connectives ('because', 'so', 'therefore') that appear constantly in training, so models do well; temporal order is usually left implicit and must be reconstructed from context, and that's exactly where they stumble Why do LLMs handle causal reasoning better than temporal reasoning?. So the explicit/implicit split isn't unique to discourse relations — it looks like a general fault line. Wherever meaning is signposted by a surface token, models look competent; wherever meaning has to be inferred, they thin out.
Why would that be? A clue lies in what training actually rewards. Models are optimized to predict information-bearing text, and the relational, implicit machinery of language — the unstated moves that hold a conversation or an argument together — isn't the kind of thing prediction loss captures well Why don't language models develop conversation maintenance skills?. The same gap appears in pragmatics: ChatGPT fails to adjust scalar implicatures ('some' implying 'not all') to communicative context, because that inference depends on tracking unspoken stakes rather than reading an explicit cue Can language models adapt implicature to conversational context?. Implicit relations, implicit conversational repair, implicit implicature — the recurring weakness is the unmarked.
There's a hopeful twist worth knowing, though. When models are made to reason explicitly, step by step, some of this changes. OpenAI's o1 can construct syntactic trees and articulate linguistic generalizations through chain-of-thought — genuine metalinguistic analysis, not just behavioral performance Can language models actually analyze language structure?. That suggests the implicit-relation failure may be partly a failure to deliberate: forced to externalize the inference, the model can sometimes reconstruct what it skips over when answering off the cuff. The unresolved question is whether that's real understanding or just a more elaborate way of surfacing patterns — and given that models reliably misread embedded clauses and complex structure as syntactic depth grows Why do large language models fail at complex linguistic tasks?, the deeper structural competence may still be missing even when the explicit reasoning looks convincing.
Sources 6 notes
ChatGPT performs well on explicit discourse relations with connectives but achieves only 24.54% accuracy on implicit relations without them. This asymmetry reveals that LLMs rely on surface signals rather than inferring meaning from semantic content.
ChatGPT excels at causal relations but struggles with temporal ordering because causal connectives are explicit and frequent in training data, while temporal order is often implicit and must be inferred contextually.
Humans keep conversations smooth through implicit techniques like reference repair and topic hand-off that sustain relational interaction, not convey information. Language models don't develop these because training signals reward information prediction, not relational work.
ChatGPT shows no context-sensitivity in computing scalar implicatures across three dimensions: explicit literal-mode instructions, information structure focus, and face-threatening contexts. Humans flexibly modulate these inferences; the model does not, suggesting pragmatic competence requires tracking communicative stakes that LLMs systematically miss.
OpenAI's o1 model successfully constructs syntactic trees and phonological generalizations through explicit step-by-step reasoning, revealing that LLM linguistic capability extends far beyond behavioral language tasks to genuine language analysis.
Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.