Why do LLMs perform better on explicit discourse connectives than implicit relations?
This explores why LLMs handle discourse relations marked by words like 'because' or 'however' but stumble when the same logical relationship is left unsaid — and what that gap reveals about how these models actually process meaning.
This explores why LLMs handle discourse relations marked by words like 'because' or 'however' but collapse when the same logical relationship is implied rather than stated. The short answer the corpus keeps circling back to: LLMs read the signpost, not the road. When a connective is present, the model has a surface token to latch onto; when the relationship has to be inferred from the meaning of two clauses, performance falls off a cliff. One study finds ChatGPT does fine on explicit relations but drops to 24.54% accuracy on implicit ones — strong evidence that discourse 'competence' here is really pattern-matching on visible cues, not structural understanding of how ideas connect Why does ChatGPT fail at implicit discourse relations?.
The same shape shows up when you slice discourse a different way. LLMs are noticeably better at causal reasoning than temporal reasoning, and the reason is the same mechanism: causal links tend to come with explicit, frequent connectives in training text, while temporal order is usually left implicit and must be reconstructed from context Why do LLMs handle causal reasoning better than temporal reasoning?. So 'explicit vs. implicit' isn't a quirk of one task — it's a fault line that runs through the model's whole relationship with language. Where the signal is on the surface, the model thrives; where it has to be computed, it guesses.
What's interesting is that this isn't only a discourse problem — it's the same failure dressed in different clothes across the corpus. LLMs treat presupposition triggers and non-factive verbs as surface cues instead of computing their actual effect on what's entailed Why do embedding contexts confuse LLM entailment predictions?, and they'll accept false presuppositions even when they demonstrably know the correct fact Why do language models accept false assumptions they know are wrong?. Grammatical competence degrades predictably as sentences get more structurally complex — embedded clauses and recursion break the model where simple sentences don't Does LLM grammatical performance decline with structural complexity?. In every case the diagnosis is identical: statistical learning captures the visible marker but not the underlying structure it points to.
The deeper framing is that LLMs reason through semantic association, not symbolic manipulation. When you strip the familiar semantic content out of a reasoning task and leave only the logical rules, performance collapses — the model was never running the inference, it was riding the token associations Do large language models reason symbolically or semantically?. An explicit connective is exactly the kind of high-frequency association the model has memorized; an implicit relation demands the symbolic inference it doesn't actually do. One synthesis maps these breakdowns specifically to discourse intentionality and attention layers, suggesting the gap isn't just about surface vocabulary but about how the architecture allocates attention across a passage Where exactly do language models fail at structural language tasks?.
Here's the thing you might not have expected to find: the limitation may be more about training signal than raw capacity. With explicit chain-of-thought prompting, models can construct genuine syntactic trees and metalinguistic analyses they fail at in normal use Can language models actually analyze language structure?, and a related conversational gap — ignoring distractors — closes after fine-tuning on barely a thousand examples Why do language models engage with conversational distractors?. So the explicit/implicit asymmetry isn't necessarily a hard ceiling. It may be that implicit relations are simply underrepresented as a learnable signal — the model never got enough reason to compute what it could instead just read off the surface.
Sources 9 notes
ChatGPT performs well on explicit discourse relations with connectives but achieves only 24.54% accuracy on implicit relations without them. This asymmetry reveals that LLMs rely on surface signals rather than inferring meaning from semantic content.
ChatGPT excels at causal relations but struggles with temporal ordering because causal connectives are explicit and frequent in training data, while temporal order is often implicit and must be inferred contextually.
LLMs treat presupposition triggers and non-factive verbs as surface cues rather than computing their opposite semantic effects on entailments. This structural failure persists across prompts and models, suggesting models rely on surface patterns instead of structural analysis.
The FLEX Benchmark shows that models reject false presuppositions at rates far below acceptable levels (GPT-4: 84%, Mistral: 2.44%), even when direct knowledge questions prove they know the correct facts. False presuppositions drive more accommodation than correct knowledge drives rejection.
LLMs show systematic performance decline as syntactic depth and embedding increase. Simple sentences are handled well while complex structures with recursion and embedding fail consistently, suggesting LLMs learned surface heuristics rather than structural grammar rules.
When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.
Language models excel with explicit discourse markers and simple grammar but fail predictably on implicit relations, embedded structures, and forward-planning discourse. These breakdowns map to failures in discourse intentionality and attention layers, not just linguistic surface structure.
OpenAI's o1 model successfully constructs syntactic trees and phonological generalizations through explicit step-by-step reasoning, revealing that LLM linguistic capability extends far beyond behavioral language tasks to genuine language analysis.
Fine-tuning on just 1,080 synthetic dialogues with distractor turns significantly improves topic resilience, revealing that the gap is not model capacity but absent training signal. Models learn to follow what-to-do instructions but not what-to-ignore instructions.