What semantic failures break dialogue coherence most realistically?
Can we distinguish distinct types of incoherence by manipulating semantic structure rather than surface text? This matters because text-level evaluations miss the semantic failures that actually occur in dialogue systems.
Evaluating dialogue coherence has relied on text-level manipulations — shuffling turn order, replacing utterances from external conversations. DEAM demonstrates these are insufficient: classifiers trained on text-level negatives CANNOT detect AMR-based semantic negatives, but classifiers trained on AMR-based negatives CAN detect text-level ones. Semantic-level incoherence is harder and more realistic.
The four failure modes map to distinct AI dialogue failures:
1. Contradiction — directly or indirectly contradicting previous utterances. Generated by adding polarity or replacing concepts with antonyms from ConceptNet. A common issue in deployed dialogue systems.
2. Coreference inconsistency — incorrect references to previously mentioned entities. Pronouns play an essential role — coherence is preserved through correct reference chains. Generated by manipulating argument nodes in AMR graphs.
3. Irrelevancy — utterances unrelated to the dialogue context. The simplest form (random substitution) was already captured by prior work, but AMR-based irrelevancy creates more subtle, natural-sounding deviations.
4. Decreased engagement — a speaker evading questions or failing to provide detail. Prior work ignored this failure mode entirely. In coherent conversations, speakers exchange detailed opinions, ask and answer questions. When one interlocutor becomes evasive or vague, coherence degrades even if individual utterances are grammatically and semantically acceptable.
The fourth failure mode is the most novel: decreased engagement is not a semantic error but a pragmatic one. The content is acceptable; the communicative effort is insufficient. This connects directly to the grounding problem. Since Why do language models sound fluent without grounding?, LLMs may produce responses that are semantically appropriate but pragmatically disengaged — answering without engaging.
The AMR approach works because Abstract Meaning Representation captures semantic structure (named entities, negations, coreferences, modalities) at a level deeper than surface syntax, allowing manipulations that produce natural-sounding but semantically incoherent text. The AMR-to-Text step ensures the negative examples sound realistic rather than obviously broken.
These four failure modes map onto What three layers must discourse systems actually track?: contradiction and coreference inconsistency involve the attentional component (tracking what entities are currently salient), irrelevancy involves the intentional component (whether an utterance serves the discourse purpose), and decreased engagement spans all three — a speaker who stops engaging is withdrawing from the linguistic, intentional, and attentional structure simultaneously.
Inquiring lines that use this note as a source 21
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Why does dialogue-shaped text fail to produce dialogue-like operations in practice?
- Why do dialogue systems fail to detect declarative clarification requests?
- What are the specific geometric signatures of failed conversations?
- What role does entity salience play in detecting incoherence?
- How do discourse structure and dialogue state management relate to each other?
- How do coreference chains preserve coherence across dialogue turns?
- Can AMR manipulation reveal where discourse coherence actually breaks down?
- How do semantic failure modes map to attentional and intentional layers?
- How do dialogue coherence failures map onto the three discourse components?
- Can discourse communities collectively detect disruptions individual readers miss?
- What is event-residue and how does it differ from utterances?
- What happens to dialogue coherence when topic models use rigid stacks instead of flexible revisitation?
- Why do discourse failures cluster in attention and intentional layers rather than linguistics?
- How does temporal event structure scaffold coherence in dialogue?
- What distinguishes local coherence from global coherence in dialogue?
- What role does accommodation play in making discourse coherent?
- What role does discourse structure play in determining at-issueness?
- What dialogue content gaps remain after review augmentation?
- Can discourse-level structure and conversational-level organization work together?
- Why do conversations with good openings but abrupt pivots fail most visibly?
- How do turn-level retrieval failures differ from dialogue-level accumulation failures?
Related concepts in this collection 8
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
How do readers track segments, purposes, and salience together?
Can discourse processing actually happen in parallel rather than sequentially? This matters because understanding how readers coordinate multiple layers of meaning at once reveals where AI systems break down in comprehension.
DEAM provides four specific failure modes within the coherence tracking framework
-
Why do language models sound fluent without grounding?
Explores whether LLM fluency masks the absence of communicative work—the clarifying questions, acknowledgments, and understanding checks that humans perform. Why does skipping these acts make models sound more confident?
decreased engagement is a specific form of the grounding gap: technically responding but not communicatively working
-
Why does ChatGPT fail at implicit discourse relations?
ChatGPT excels when discourse connectives are present but drops to 24% accuracy without them. What does this gap reveal about how LLMs actually process meaning and logical relationships?
AMR-based incoherence operates at the implicit level where LLMs fail
-
What three layers must discourse systems actually track?
Grosz and Sidner's 1986 framework proposes that discourse requires simultaneously tracking linguistic segments, speaker purposes, and salient objects. Understanding why all three are necessary helps explain where current AI systems structurally fail.
DEAM's four failure modes map onto Grosz & Sidner's three components: contradiction and coreference inconsistency involve the attentional component (tracking salient entities), irrelevancy involves the intentional component (purpose alignment), and decreased engagement spans all three
-
Can tracking dialogue dimensions simultaneously reveal hidden conversation patterns?
Does encoding linguistic complexity, emotion, topics, and relevance as parallel temporal streams expose emergent patterns that traditional statistical analysis misses? This matters because conversation success may depend on interactions between dimensions, not individual features alone.
DEAM's four failure modes would produce distinct signatures in Conversational DNA's multi-dimensional tracking: contradiction as semantic volatility, coreference as referential discontinuity, engagement as temporal trajectory decline
-
Do language models segment events like human consensus does?
Can GPT-3 identify event boundaries in narrative text the way humans do? This matters because it could reveal whether language models and human cognition share similar predictive mechanisms for understanding continuous experience.
event segmentation provides temporal scaffolding for coherence: correctly segmented events make contradictions and coreference inconsistencies detectable within and across event boundaries
-
What six problems must every conversation solve?
Schegloff's Conversation Analysis identifies six universal organizational challenges that speakers navigate in all talk-in-interaction. Understanding these helps explain why current AI dialogue systems fall short of human fluency.
DEAM's failure modes map to specific Schegloff orders: contradiction and coreference signal trouble-handling failures (understanding problems not repaired), decreased engagement is action-formation failure (speaker stops performing appropriate actions), irrelevancy is sequence-organization failure (turn doesn't cohere with prior)
-
Can conversation structure predict dialogue success better than content?
Does the geometric shape of how dialogue unfolds—timing, repetition, topic drift—matter as much as what people actually say? This explores whether interactive patterns hold signals hidden in word choice alone.
DEAM's failure modes would produce distinct TRACE geometric signatures: contradiction as distance spikes, coreference as referential drift, engagement as flattened dynamics
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- DEAM: Dialogue Coherence Evaluation using AMR-based Semantic Manipulations
- From Persona to Person: Enhancing the Naturalness with Multiple Discourse Relations Graph Learning in Personalized Dialogue Generation
- Discursive Socratic Questioning: Evaluating the Faithfulness of Language Models’ Understanding of Discourse Relations
- Abg-CoQA: Clarifying Ambiguity in Conversational Question Answering
- Large Language Model Reasoning Failures
- A recipe for annotating grounded clarifications
- Conversational Alignment with Artificial Intelligence in Context
- MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs
Original note title
dialogue coherence has four semantic-level failure modes distinguishable through AMR manipulation — contradiction coreference inconsistency irrelevancy and decreased engagement