How do discourse relation types improve dialogue beyond sentence-level semantic matching?

This explores what dialogue systems gain by modeling the *relationships between utterances* — causal, temporal, repair, hand-off — rather than just matching the meaning of one sentence to another.

This explores what dialogue systems gain by modeling the *relationships between utterances* — causal links, temporal order, repair moves, topic hand-offs — rather than just matching the meaning of one sentence against another. The corpus doesn't have a single paper that uses the phrase "discourse relation types," but several notes circle the same territory under different names, and together they make a sharp case: most of what holds a conversation together lives *between* sentences, not inside them.

Start with the raw building blocks. LLMs are noticeably better at causal relations than temporal ones, and the reason is telling — causal connectives ("because," "so," "therefore") are explicit and frequent in training text, while temporal order is usually left implicit and has to be inferred Why do LLMs handle causal reasoning better than temporal reasoning?. That's the discourse-relation lesson in miniature: when the relation between two utterances is marked on the surface, models handle it well; when it's only carried by the structure of the exchange, they stumble. Sentence-level semantic matching never sees that structure at all.

The deeper payoff is pragmatic rather than semantic. One line of work reframes dialogue understanding as *command generation* instead of intent classification — treating a turn by what it's trying to do in context, not what it literally says, which sidesteps annotation and handles context naturally Can command generation replace intent classification in dialogue systems?. A complementary note argues that the glue of conversation — reference repair, topic hand-off — is *social action*, not information encoding, and that models never learn it because training rewards predicting content, not doing relational work Why don't language models develop conversation maintenance skills?. Both point past semantics: the relation a turn bears to what came before is the thing that matters.

This is also where today's models visibly break. LLMs treat the opening prompt as a fixed frame and can't jointly update common ground — when a user pivots or contradicts an earlier framing, the model can't absorb the revision, so the human ends up as the sole scorekeeper Can LLMs truly update shared conversational common ground?. The proposed fixes are explicitly relational: collaborative rational speech acts add an information-theoretic layer for tracking *both* speakers' beliefs as understanding moves from partial to shared Can dialogue systems track both speakers' beliefs across turns?, and multi-turn-aware reward shaping trains models to ask clarifying questions and discover intent over a whole exchange rather than maximizing the next single reply Why do language models respond passively instead of asking clarifying questions?.

The thing you might not have expected: the corpus suggests discourse relations aren't a feature you bolt onto a semantic matcher — they're a *different training objective entirely*. Sentence matching optimizes for "what does this turn mean," while everything above optimizes for "what does this turn do to the shared state between us." That second question is what makes a conversation feel coherent, and it's exactly the one next-token prediction is built to ignore.

Sources 6 notes

Why do LLMs handle causal reasoning better than temporal reasoning?

ChatGPT excels at causal relations but struggles with temporal ordering because causal connectives are explicit and frequent in training data, while temporal order is often implicit and must be inferred contextually.

Can command generation replace intent classification in dialogue systems?

Rasa's dialogue understanding architecture generates domain-specific commands instead of classifying intents, eliminating annotation requirements, handling context naturally, and scaling without degradation—treating understanding as pragmatics rather than semantics.

Why don't language models develop conversation maintenance skills?

Humans keep conversations smooth through implicit techniques like reference repair and topic hand-off that sustain relational interaction, not convey information. Language models don't develop these because training signals reward information prediction, not relational work.

Can LLMs truly update shared conversational common ground?

LLMs interpret all subsequent conversational turns within a fixed initial prompt frame, preventing them from symmetrically proposing updates to shared assumptions. Even when users pivot topics or contradict earlier framings, the model cannot absorb revisions into jointly held background—making the user the sole maintainer of conversational scoreboard.

Can dialogue systems track both speakers' beliefs across turns?

CRSA integrates rate-distortion theory with RSA to enable bidirectional belief tracking across dialogue turns. Demonstrated on referential games and doctor-patient dialogues, it captures progression from partial to shared understanding, providing the information-theoretic framework that token-level LLM systems lack.

Why do language models respond passively instead of asking clarifying questions?

CollabLLM demonstrates that standard RLHF training optimizes for immediate helpfulness, discouraging models from asking clarifying questions or offering multi-turn insights. Multi-turn-aware rewards that estimate long-term interaction value enable active intent discovery and genuine collaboration.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a dialogue systems researcher. The question: Do discourse relation types (causal, temporal, repair, topic hand-off) improve LLM dialogue beyond sentence-level semantic matching — and if so, how?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026. The library identifies these constraints:
• LLMs handle explicit causal connectives well but struggle with implicit temporal order, suggesting relation markup matters more than semantic content alone (~2025).
• Dialogue coherence depends on *social action* (reference repair, topic shifts) that models never learn because training rewards content prediction, not relational work (~2024).
• Models cannot jointly update common ground with users; the human remains sole scorekeeper of revisions and contradictions (~2025).
• Single-turn reward optimization (next-token prediction) systematically ignores multi-turn relational reasoning; fixes require collaborative rational speech acts or multi-turn-aware shaping (~2025).
• Intent mismatch across multi-turn sequences causes models to "get lost," losing track of discourse coherence (~2026).

Anchor papers (verify; mind their dates):
• arXiv:2307.16689 (2023) — repair in conversational QA.
• arXiv:2507.14063 (2025) — Collaborative Rational Speech Acts for multi-turn dialogue.
• arXiv:2602.07338 (2026) — intent mismatch in multi-turn conversation.
• arXiv:2508.18167 (2025) — when LLMs should speak (turn-taking via discourse structure).

Your task:
(1) RE-TEST EACH CONSTRAINT. For causal vs. temporal reasoning: do newer models (o1, reasoning-focused LLMs) close the gap, or does it persist? Does explicit discourse relation markup (via fine-tuning, prompting, or memory systems) now measurably improve multi-turn coherence? Test whether collaborative or multi-agent orchestration (e.g., debate, hierarchical prompting) *replaces* the need for relational training, or whether it merely masks the underlying deficit. Name what resolved or still holds.
(2) Surface the strongest *contradicting* or *superseding* work from the last ~6 months — papers claiming discourse relations don't matter, or that semantic matching alone now suffices, or that new architectures (state tracking, memory augmentation) bypass the relational problem entirely.
(3) Propose 2 research questions that assume the regime has moved: (a) Can emergent multi-agent dialogue (where LLMs negotiate discourse structure with each other) bootstrap relational understanding without explicit supervision? (b) Do vision-language or multimodal models, which must ground relations in grounded referents, handle discourse relations better than text-only LLMs?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

How do discourse relation types improve dialogue beyond sentence-level semantic matching?

Sources 6 notes

Next inquiring lines