What distinguishes local coherence from global coherence in dialogue?
This explores the difference between dialogue that hangs together turn-by-turn (each reply makes sense given the last) versus dialogue that hangs together as a whole arc (the conversation tracks a shared purpose, accumulating common ground across many turns) — and where the corpus locates each kind of failure.
This explores the split between *local* coherence — whether each turn connects sensibly to the one before it — and *global* coherence — whether the whole conversation holds together as a structured arc with a shared goal. The cleanest map of this split comes from work on discourse processing, which argues coherence isn't one thing but three layers tracked at once: the linguistic segments (what was just said), the intentional structure (what the conversation is *for*), and attentional salience (what's in focus right now) How do readers track segments, purposes, and salience together?. Local coherence lives mostly in the first and third layers; global coherence lives in the second. A reply can be locally fine — grammatical, on-topic, responsive — while the conversation as a whole drifts off its purpose.
The corpus shows local failures are the easier ones to catch. Research using Abstract Meaning Representation found that turn-level incoherence comes in four detectable flavors — contradiction, coreference inconsistency, irrelevancy, and dropping engagement — and that these semantic breaks are visible to trained classifiers even when surface text manipulations are not What semantic failures break dialogue coherence most realistically?. These are largely *local* signals: a pronoun with no referent, a claim that clashes with the previous line. But global coherence shows up in the *shape* of the conversation, not any single turn. The TRACE work found that structural trajectory alone predicts whether a dialogue succeeds about as well as reading all the content — and combining structure with content beats either Can conversation structure predict dialogue success better than content?. The 'Conversational DNA' framing pushes the same idea: coherence is a temporal stream you track across the whole dialogue, not a property you check turn by turn Can tracking dialogue dimensions simultaneously reveal hidden conversation patterns?.
Here's the thing the corpus suggests you didn't know you wanted to know: large language models are pretty good at *local* coherence and structurally bad at *global* coherence — and it's the same mechanism causing both. Because an LLM reads every later turn inside its fixed opening frame, it can't jointly update the shared 'scoreboard' of assumptions the way two humans do; when you pivot or contradict yourself, the model can't absorb that revision into mutually held background, leaving the user as the sole keeper of common ground Can LLMs truly update shared conversational common ground?. Preference optimization makes this worse: RLHF rewards confident single-turn helpfulness, which strips out the grounding acts — clarifying questions, understanding checks — that humans use to maintain coherence across a long exchange, cutting them ~77% below human rates Does preference optimization harm conversational understanding?. The result is a model that nails every turn and silently loses the thread of the whole.
The more formal accounts frame global coherence as *bidirectional belief tracking* — keeping a running model of what both speakers now jointly understand, progressing from partial to shared knowledge. Collaborative Rational Speech Acts builds exactly this across multi-turn dialogue, supplying the cross-turn belief accounting that token-level LLM generation lacks Can dialogue systems track both speakers' beliefs across turns?. There's a deeper reason commitment is hard for these models: the 20-questions regeneration test shows an LLM holds a *superposition* of possible characters and samples one at generation time rather than committing — so local consistency is cheap (any sample fits prior context) but a stable global stance is not guaranteed Do large language models actually commit to a single character?.
So the distinction isn't just academic. Local coherence is turn-adjacency you can audit with semantic classifiers; global coherence is sustained purpose, accumulated common ground, and a committed stance that only reveals itself across the whole conversation — and it's precisely the dimension current systems track worst.
Sources 8 notes
Discourse processing demands parallel recognition of linguistic segments, intentional structure, and attentional salience—not sequential processing. These three layers constrain each other during comprehension, and failures in any single layer disrupt overall understanding.
Research using Abstract Meaning Representation identified four distinct incoherence types: contradiction, coreference inconsistency, irrelevancy, and decreased engagement. AMR-trained classifiers detect these semantic failures while text-level manipulations alone cannot.
TRACE achieved 68% accuracy predicting dialogue success from structural features alone, matching a 70% content-based baseline. A hybrid combining both reached 80%, suggesting how agents communicate rivals what they say.
Conversational DNA encodes four simultaneous dimensions—linguistic complexity, emotional trajectories, topic coherence, and conversational relevance—as temporal streams. The reverse Turing test finding showed expert assessments of AI diverged sharply, suggesting conversational structure shapes interpretation as much as content.
LLMs interpret all subsequent conversational turns within a fixed initial prompt frame, preventing them from symmetrically proposing updates to shared assumptions. Even when users pivot topics or contradict earlier framings, the model cannot absorb revisions into jointly held background—making the user the sole maintainer of conversational scoreboard.
RLHF optimizes models for single-turn helpfulness by rewarding confident responses over clarifying questions and understanding checks. This preference alignment systematically reduces grounding acts by 77.5% below human levels, creating an alignment tax where models appear helpful but fail silently in multi-turn contexts.
CRSA integrates rate-distortion theory with RSA to enable bidirectional belief tracking across dialogue turns. Demonstrated on referential games and doctor-patient dialogues, it captures progression from partial to shared understanding, providing the information-theoretic framework that token-level LLM systems lack.
Shanahan's 20-questions test shows LLMs maintain a superposition of consistent objects or characters and sample from that distribution at generation time. Regenerating the same response yields different outputs, each consistent with prior context, proving no fixed commitment exists.