INQUIRING LINE

What distinguishes local coherence from global coherence in dialogue?

This explores the difference between dialogue that hangs together turn-by-turn (each reply makes sense given the last) versus dialogue that hangs together as a whole arc (the conversation tracks a shared purpose, accumulating common ground across many turns) — and where the corpus locates each kind of failure.


This explores the split between *local* coherence — whether each turn connects sensibly to the one before it — and *global* coherence — whether the whole conversation holds together as a structured arc with a shared goal. The cleanest map of this split comes from work on discourse processing, which argues coherence isn't one thing but three layers tracked at once: the linguistic segments (what was just said), the intentional structure (what the conversation is *for*), and attentional salience (what's in focus right now) How do readers track segments, purposes, and salience together?. Local coherence lives mostly in the first and third layers; global coherence lives in the second. A reply can be locally fine — grammatical, on-topic, responsive — while the conversation as a whole drifts off its purpose.

The corpus shows local failures are the easier ones to catch. Research using Abstract Meaning Representation found that turn-level incoherence comes in four detectable flavors — contradiction, coreference inconsistency, irrelevancy, and dropping engagement — and that these semantic breaks are visible to trained classifiers even when surface text manipulations are not What semantic failures break dialogue coherence most realistically?. These are largely *local* signals: a pronoun with no referent, a claim that clashes with the previous line. But global coherence shows up in the *shape* of the conversation, not any single turn. The TRACE work found that structural trajectory alone predicts whether a dialogue succeeds about as well as reading all the content — and combining structure with content beats either Can conversation structure predict dialogue success better than content?. The 'Conversational DNA' framing pushes the same idea: coherence is a temporal stream you track across the whole dialogue, not a property you check turn by turn Can tracking dialogue dimensions simultaneously reveal hidden conversation patterns?.

Here's the thing the corpus suggests you didn't know you wanted to know: large language models are pretty good at *local* coherence and structurally bad at *global* coherence — and it's the same mechanism causing both. Because an LLM reads every later turn inside its fixed opening frame, it can't jointly update the shared 'scoreboard' of assumptions the way two humans do; when you pivot or contradict yourself, the model can't absorb that revision into mutually held background, leaving the user as the sole keeper of common ground Can LLMs truly update shared conversational common ground?. Preference optimization makes this worse: RLHF rewards confident single-turn helpfulness, which strips out the grounding acts — clarifying questions, understanding checks — that humans use to maintain coherence across a long exchange, cutting them ~77% below human rates Does preference optimization harm conversational understanding?. The result is a model that nails every turn and silently loses the thread of the whole.

The more formal accounts frame global coherence as *bidirectional belief tracking* — keeping a running model of what both speakers now jointly understand, progressing from partial to shared knowledge. Collaborative Rational Speech Acts builds exactly this across multi-turn dialogue, supplying the cross-turn belief accounting that token-level LLM generation lacks Can dialogue systems track both speakers' beliefs across turns?. There's a deeper reason commitment is hard for these models: the 20-questions regeneration test shows an LLM holds a *superposition* of possible characters and samples one at generation time rather than committing — so local consistency is cheap (any sample fits prior context) but a stable global stance is not guaranteed Do large language models actually commit to a single character?.

So the distinction isn't just academic. Local coherence is turn-adjacency you can audit with semantic classifiers; global coherence is sustained purpose, accumulated common ground, and a committed stance that only reveals itself across the whole conversation — and it's precisely the dimension current systems track worst.


Sources 8 notes

How do readers track segments, purposes, and salience together?

Discourse processing demands parallel recognition of linguistic segments, intentional structure, and attentional salience—not sequential processing. These three layers constrain each other during comprehension, and failures in any single layer disrupt overall understanding.

What semantic failures break dialogue coherence most realistically?

Research using Abstract Meaning Representation identified four distinct incoherence types: contradiction, coreference inconsistency, irrelevancy, and decreased engagement. AMR-trained classifiers detect these semantic failures while text-level manipulations alone cannot.

Can conversation structure predict dialogue success better than content?

TRACE achieved 68% accuracy predicting dialogue success from structural features alone, matching a 70% content-based baseline. A hybrid combining both reached 80%, suggesting how agents communicate rivals what they say.

Can tracking dialogue dimensions simultaneously reveal hidden conversation patterns?

Conversational DNA encodes four simultaneous dimensions—linguistic complexity, emotional trajectories, topic coherence, and conversational relevance—as temporal streams. The reverse Turing test finding showed expert assessments of AI diverged sharply, suggesting conversational structure shapes interpretation as much as content.

Can LLMs truly update shared conversational common ground?

LLMs interpret all subsequent conversational turns within a fixed initial prompt frame, preventing them from symmetrically proposing updates to shared assumptions. Even when users pivot topics or contradict earlier framings, the model cannot absorb revisions into jointly held background—making the user the sole maintainer of conversational scoreboard.

Does preference optimization harm conversational understanding?

RLHF optimizes models for single-turn helpfulness by rewarding confident responses over clarifying questions and understanding checks. This preference alignment systematically reduces grounding acts by 77.5% below human levels, creating an alignment tax where models appear helpful but fail silently in multi-turn contexts.

Can dialogue systems track both speakers' beliefs across turns?

CRSA integrates rate-distortion theory with RSA to enable bidirectional belief tracking across dialogue turns. Demonstrated on referential games and doctor-patient dialogues, it captures progression from partial to shared understanding, providing the information-theoretic framework that token-level LLM systems lack.

Do large language models actually commit to a single character?

Shanahan's 20-questions test shows LLMs maintain a superposition of consistent objects or characters and sample from that distribution at generation time. Regenerating the same response yields different outputs, each consistent with prior context, proving no fixed commitment exists.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a dialogue systems researcher. The question remains open: what distinguishes LOCAL coherence (turn-to-turn continuity) from GLOBAL coherence (sustained purpose, shared goals, accumulated common ground across a full exchange)?

What a curated library found — and when (findings span 2021–2025, treat as dated claims):
• Local coherence failures cluster in four semantic types (contradiction, coreference inconsistency, irrelevancy, engagement drop) detectable by classifiers; global coherence lives in conversation *shape*, not individual turns (2022).
• LLM systems excel at local turn-level coherence but fail at global coherence because they cannot jointly update common ground with users — the model reads every turn inside a fixed opening frame, leaving the user as sole keeper of background assumptions (2024–2025).
• RLHF preference optimization erases grounding acts (clarifications, understanding checks) that humans use to maintain long-exchange coherence, cutting them ~77% below human rates, trading global tracking for confident single-turn helpfulness (2025).
• Structural trajectory alone predicts dialogue success ~as well as content; combining structure + content beats either; conversation geometry encodes temporal coherence independent of semantic content (2023–2025).
• LLMs hold superposition of possible stances rather than committing to a single global viewpoint; local consistency is cheap (any sample fits prior context), but stable global identity across turns is not guaranteed (2024–2025).

Anchor papers (verify; mind their dates):
• arXiv:2203.09711 (DEAM, 2022) — semantic-level local failure modes via AMR
• arXiv:2311.09144 (Grounding Gaps, 2023) — common-ground asymmetry in LLM dialogue
• arXiv:2507.14063 (Collaborative Rational Speech Acts, 2025) — bidirectional belief tracking for multi-turn coherence
• arXiv:2508.07520 (Conversational DNA, 2025) — temporal structure as coherence signal

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, determine whether recent advances in long-context models (e.g., 200k+ tokens), conversation memory architectures, multi-agent orchestration, explicit goal-tracking mechanisms, or new training paradigms (e.g., process reward modeling, dialogue-specific RL like 2507.14063) have since *relaxed* the claimed deficit in global coherence tracking. Separate the durable question (how do models maintain commitment across turns?) from the perishable limitation (models cannot update common ground). Cite what resolved it or confirm the constraint still holds.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Look for papers claiming LLMs DO sustain global coherence, or that local + global distinction dissolves under certain architectures or training regimes.
(3) Propose 2 research questions that ASSUME the regime may have shifted: one assuming long-context or memory-augmented models now handle global coherence better, one assuming the local–global split itself is an artifact of generation-time sampling rather than a real architectural boundary.

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines