Why does ChatGPT fail at implicit discourse relations?
ChatGPT excels when discourse connectives are present but drops to 24% accuracy without them. What does this gap reveal about how LLMs actually process meaning and logical relationships?
The discourse relations paper (ChatGPT on temporal, causal, and discourse relations) found a dramatic asymmetry in ChatGPT's discourse understanding:
- Explicit discourse relations (with connectives like "so," "because," "however"): ChatGPT performs well, can recognize most relation types, and in-context learning with label dependence structure helps further
- Implicit discourse relations (no connectives): 24.54% accuracy, 16.20% F1 on 11 second-level relation classes
This is not a small gap. 24.54% accuracy on implicit discourse relations is barely above chance for an 11-class task. ChatGPT "cannot understand the abstract sense of each discourse relation and the features from the text" when the surface connectives are absent.
The explanation is transparent: LLMs have access to massive training data where connectives are pervasive and reliable signals. When you see "therefore" or "because," the discourse relation is explicit in the surface form. Learning to respond to these signals is straightforward statistical learning. Inferring the same relations without surface signals requires understanding what the two clauses actually mean and what logical relationship holds between them.
This asymmetry shows that what LLMs have learned for discourse relation detection is largely cue-based — they respond to surface signals, not to structural meaning. When the surface cue is removed, the competence collapses.
This connects directly to What three layers must discourse systems actually track?: implicit discourse relation detection requires exactly the intentional structure that the linguistic structure alone doesn't carry.
A concrete instance beyond discourse relations: The same explicit/implicit asymmetry surfaces in metaphor extraction. LLMs can identify explicit source-target domain mappings (where the analogy's terms are stated) but fail on the implicit elements human readers routinely infer — e.g., the unstated target concept that completes a proportional analogy where only three of four terms are given. The failure is not specific to discourse-connective tasks; it is the general pattern wherever meaning depends on what is not said.
The literary analysis implication: Poetry and literary prose operate primarily through implicit relations. The connections between images in a poem, the causal logic of a narrative, the thematic resonance between scenes — these are rarely marked by explicit connectives. A poet does not write "the rose symbolizes mortality because..." The reader must infer the relation. This means the 24% implicit accuracy rate is not a peripheral limitation for literary analysis — it is a central one. Since Can LLMs truly understand literary meaning or just mechanics?, the discourse competence asymmetry is one of four converging mechanisms that explain why LLMs can parse literary texts mechanically but cannot interpret them meaningfully.
Inquiring lines that use this note as a source 8
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Why do LLMs achieve only 24 percent accuracy on implicit discourse relations?
- What makes relational structure sufficient for generating contextually appropriate discourse?
- How do the four discourse relations differ in their connection to anxiety?
- Why do language models fail at implicit discourse relations while handling explicit connectives?
- Why do explicit discourse connectives help LLMs but implicit relations cause failures?
- Can language models distinguish explicit from implicit discourse relations?
- Why do LLMs perform better on explicit discourse connectives than implicit relations?
- Why do explicit discourse connectives work when implicit relations fail?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Why do LLMs handle causal reasoning better than temporal reasoning?
Exploring whether language models perform asymmetrically on different discourse relations and what training data patterns might explain the gap between causal and temporal reasoning abilities.
asymmetric competence from training data distribution; parallel finding
-
What three layers must discourse systems actually track?
Grosz and Sidner's 1986 framework proposes that discourse requires simultaneously tracking linguistic segments, speaker purposes, and salient objects. Understanding why all three are necessary helps explain where current AI systems structurally fail.
what implicit relations require that surface cues don't provide
-
Can models pass tests while missing the actual grammar?
Do language models succeed on grammatical benchmarks by learning surface patterns rather than structural rules? This matters because correct outputs may hide reliance on shallow heuristics that fail on novel structures.
structural parallel: correct behavior on easy cases from surface heuristics
-
Can language models adapt implicature to conversational context?
Do large language models flexibly modulate scalar implicatures based on information structure, face-threatening situations, and explicit instructions—as humans do? This tests whether pragmatic computation is truly context-sensitive or merely literal.
the pragmatic parallel: just as implicit discourse requires inferring unstated relations, scalar implicature requires context-sensitive pragmatic modulation — both fail for the same reason (surface cue dependence)
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Exploring the Potential of ChatGPT on Sentence Level Relations: A Focus on Temporal, Causal, and Discourse Relations
- Do Large Language Models Understand Conversational Implicature -- A case study with a chinese sitcom
- Turiya at DialAM-2024: Inference Anchoring Theory Based LLM Parsers
- What Makes a Good Natural Language Prompt?
- Discursive Socratic Questioning: Evaluating the Faithfulness of Language Models’ Understanding of Discourse Relations
- Can Large Language Models Understand Context?
- A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity
- Pragmatic Implicature Processing in ChatGPT
Original note title
llm discourse competence is asymmetric: explicit connectives enable performance but implicit relations cause systematic failure