Why do explicit discourse connectives help LLMs but implicit relations cause failures?

This explores why LLMs do well when relationships between ideas are spelled out by connective words ('because', 'however', 'so') but stumble when those relationships have to be inferred — and what that gap reveals about how the models actually process language.

This explores why LLMs do well when the link between two ideas is marked by an explicit word — 'because', 'although', 'then' — but fail when that link is left implicit and must be inferred from meaning. The short version the corpus keeps arriving at: the models are reading the surface, not the structure. When ChatGPT handles explicit discourse relations it's leaning on the connective as a visible token; strip the connective and accuracy collapses to around 24%, which tells you the competence was never in understanding the relationship, only in recognizing its label Why does ChatGPT fail at implicit discourse relations?.

The same pattern shows up wherever a cue is present versus absent. LLMs handle causal reasoning better than temporal reasoning for exactly this reason — causal connectives are frequent and explicit in training text, while temporal ordering is usually left for the reader to reconstruct, so the model has no surface signal to grab Why do LLMs handle causal reasoning better than temporal reasoning?. It's not that 'cause' is conceptually easier than 'before'; it's that one is written down and the other is implied. The broader linguistic-competence work generalizes this into a rule: models excel with explicit markers and simple grammar but break down predictably on implicit relations, embedded clauses, and anything requiring forward-planning across a discourse Where exactly do language models fail at structural language tasks?, with failure scaling up as structural depth increases Why do large language models fail at complex linguistic tasks? Does LLM grammatical performance decline with structural complexity?.

What connects discourse failures to seemingly unrelated bugs is the mechanism underneath. The presupposition research is the clearest tell: models treat presupposition triggers and non-factive verbs as surface cues rather than computing the opposite semantic effects they actually have on entailment — a structural blind that survives across prompts and models Why do embedding contexts confuse LLM entailment predictions?. And when you deliberately strip semantic content away from a reasoning task, performance collapses even with the correct rules sitting right there in context, because the models reason through learned token associations, not symbolic manipulation Do large language models reason symbolically or semantically?. Implicit relations are precisely the case where there's no associative cue to ride — the relationship lives in the structure, and structure is what these models don't represent.

Here's the turn worth sitting with: this isn't a defect to be patched, it may be what the architecture is. One line of work argues LLMs operationalize Saussure's *langue* — they compress purely relational structure from text with no external referent, learning meaning as patterns of co-occurrence Can language models learn meaning without engaging the world?. From that angle an explicit connective isn't a hint, it's the actual unit of meaning the model trades in; the implicit relation was never encoded anywhere it could find it. So the asymmetry is diagnostic — it shows you the boundary between pattern-matching and genuine inference.

If there's a lever, it's making the implicit explicit. Forcing models to externalize the steps they'd otherwise skip — turning hidden warrants and premises into surface prompting moves, as the argumentation-scheme work does — recovers reasoning that ordinary chain-of-thought lets slide past Can structured argument prompts make LLM reasoning more rigorous?. Which is the same insight read backward: if the model can only work with what's on the surface, the fix is to put more of the relationship on the surface.

Sources 9 notes

Why does ChatGPT fail at implicit discourse relations?

ChatGPT performs well on explicit discourse relations with connectives but achieves only 24.54% accuracy on implicit relations without them. This asymmetry reveals that LLMs rely on surface signals rather than inferring meaning from semantic content.

Why do LLMs handle causal reasoning better than temporal reasoning?

ChatGPT excels at causal relations but struggles with temporal ordering because causal connectives are explicit and frequent in training data, while temporal order is often implicit and must be inferred contextually.

Where exactly do language models fail at structural language tasks?

Language models excel with explicit discourse markers and simple grammar but fail predictably on implicit relations, embedded structures, and forward-planning discourse. These breakdowns map to failures in discourse intentionality and attention layers, not just linguistic surface structure.

Why do large language models fail at complex linguistic tasks?

Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.

Does LLM grammatical performance decline with structural complexity?

LLMs show systematic performance decline as syntactic depth and embedding increase. Simple sentences are handled well while complex structures with recursion and embedding fail consistently, suggesting LLMs learned surface heuristics rather than structural grammar rules.

Why do embedding contexts confuse LLM entailment predictions?

LLMs treat presupposition triggers and non-factive verbs as surface cues rather than computing their opposite semantic effects on entailments. This structural failure persists across prompts and models, suggesting models rely on surface patterns instead of structural analysis.

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Can language models learn meaning without engaging the world?

Research shows LLMs learn culturally situated discourse patterns by compressing relational structure from text, demonstrating that fluent language generation requires no external referents or embodied grounding.

Can structured argument prompts make LLM reasoning more rigorous?

Applying Toulmin's argument model as explicit prompting steps (CQoT) improves LLM reasoning by forcing models to identify warrants and backing rather than skipping implicit premises. The method catches failures that standard chain-of-thought prompting allows.

Why do explicit discourse connectives help LLMs but implicit relations cause failures?

Sources 9 notes

Next inquiring lines