Why do language models lose performance in longer conversations?
Does multi-turn degradation stem from fundamental model limitations, or from misalignment between what users mean and what models assume? Understanding the root cause could guide better solutions.
The Intent Mismatch paper offers a fundamentally different explanation for why LLMs get lost in multi-turn conversation. Where Laban et al. attribute the ~30% degradation to model unreliability, this paper argues the root cause is pragmatic mismatch between user expression and model interpretation — an intent alignment gap, not a capability deficit.
Two reframing moves are critical:
First, making premature assumptions is not erroneous behavior but a rational strategy induced by RLHF training. The dominant training objective rewards helpfulness and penalizes evasive responses. Under incomplete information, the model constructs a plausible task formulation for a "typical" user and produces a provisional answer — because that is what the training signal demands. The model is doing exactly what we trained it to do; the problem is that we trained it for the wrong thing in multi-turn contexts.
Second, the bottleneck is not model capacity or reasoning depth but pragmatic mismatch. Users exhibit systematic individual variation — the same utterance may map to disparate underlying intentions. General-purpose LLMs, aligned to the "average" user, cannot adapt to idiosyncratic behaviors. Models frequently misinterpret fragmentary continuations as confirmations rather than corrections, reinforcing incorrect context.
The proposed fix is architectural: a Mediator-Assistant pipeline that decouples intent understanding from task execution. The Mediator explicates user inputs — articulating latent requirements before they reach the execution Assistant. An LLM-based Refiner distills explicit guidelines from discrepancies between failed and successful interaction trajectories. This enables adaptation to individual user behaviors without weight updates.
The theoretical claim is strong: scaling model size or improving training alone cannot resolve this gap, because it arises from structural ambiguity in conversational context rather than representational limitations. This challenges the implicit assumption that bigger/better models will solve multi-turn problems. The QuestBench finding reinforces this: since Can models identify what information they actually need?, the Mediator's role in explicating latent requirements addresses a capability that models demonstrably lack — they cannot identify what information is missing even when they can solve the fully-specified version of the problem. The intent alignment gap is thus not just about pragmatic mismatch but about a separable cognitive deficit in information gathering. Furthermore, since Why do reasoning models overthink ill-posed questions?, when intent is genuinely underspecified (as it is in most multi-turn conversation), reasoning models compound the problem by overthinking rather than recognizing incompleteness — making the Mediator architecture even more necessary.
Since Why do language models respond passively instead of asking clarifying questions?, CollabLLM's reward-signal fix and this paper's architectural fix represent complementary intervention levels for the same underlying problem.
The multi-turn degradation problem exists on both sides of the interaction. User simulators — the systems that conversational agents train against — exhibit the same goal misalignment: they "struggle to consistently adhere to their user goals throughout conversations," failing to maintain profiles, manage multiple objectives, or complete within conversation limits. When simulators drift, they generate conversations that teach agents wrong behaviors through misleading reward signals. See Why do LLM user simulators fail to track their own goals?. This is the evaluation-side manifestation: agent degradation and evaluation degradation compound each other.
Inquiring lines that use this note as a source 41
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Why do comprehensive posts without uncertainty tend to suppress conversation?
- Why does context collapse pose risks in high-stakes conversations?
- Why does removing language from its context destroy what makes it work?
- Why does weakening communication inevitably eliminate it entirely?
- Does turn-level intent control prevent simulator drift during long conversations?
- Why do large language models follow user drift instead of maintaining topic focus?
- Why do conversational queries drift away from what triggered them?
- Why does adding more conversational data fail to improve maintenance skills?
- Can models infer maintenance operations from conversational text data alone?
- Does full conversation history improve or degrade multi-turn retrieval accuracy?
- What are the specific geometric signatures of failed conversations?
- Why do large language models fail at taking conversational initiative?
- How does optimizing model performance decouple from optimizing user interpretability?
- Why do language models fail when users switch between and return to topics?
- How do conversation repair patterns handle user corrections and interruptions?
- Why do language models fail at coreference across long contexts?
- Why do Claude and Llama optimize for different dialogue outcomes?
- How vulnerable are language models themselves to multi-turn persuasive pressure?
- Why do language models tend to elaborate and expand rather than compress information?
- Why do current language models fail at linguistic synchrony with clients?
- Why do weaker language models fail at multi-turn strategic questioning?
- Why do language models struggle with context-dependent pragmatic interpretation?
- How do users update their partner models during ongoing conversation?
- What makes persona-assigned language models unstable across different conversation runs?
- Why do conversational systems benefit from post-thinking between user turns?
- Why do language models use twice as many words per conversation turn?
- Which conversation types most reliably cause models to drift from Assistant mode?
- Why do models lack a stable underlying identity to return to?
- Why do benchmarks measuring string quality fail to capture communicative success?
- How does RLHF alignment training reduce multi-turn conversational capability?
- What prevents AI from recovering after conversations take a wrong turn?
- Why do longer context windows alone fail to capture temporal dynamics in dialogue?
- How does repeated content shift model outputs across multiple turns?
- Why do conversations with good openings but abrupt pivots fail most visibly?
- How does effort mismatch between user and model appear in conversation geometry?
- Why do models struggle with asking questions in multi-turn conversational reasoning tasks?
- How does treating conversation as a resource change what models learn to do?
- How do turn-level retrieval failures differ from dialogue-level accumulation failures?
- Why do alignment values become problematic as language models scale?
- Why do current large language models fail to entrain with users?
- What structural updates prevent context collapse in evolving conversations?
Related concepts in this collection 10
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Why do language models respond passively instead of asking clarifying questions?
Explores whether the reward signals used to train language models might actively discourage them from seeking clarification or taking initiative in conversations, and what alternative training approaches might enable more collaborative dialogue.
complementary intervention: reward fix (CollabLLM) vs. architecture fix (Mediator-Assistant)
-
Why do language models fail in gradually revealed conversations?
Explores why LLMs perform 39% worse when instructions arrive incrementally rather than upfront, and whether they can recover from early mistakes in multi-turn dialogue.
the phenomenon this paper reinterprets
-
Does preference optimization harm conversational understanding?
Exploring whether RLHF training that rewards confident, complete responses undermines the grounding acts—clarifications, checks, acknowledgments—that actually build shared understanding in dialogue.
RLHF causing the helpfulness-passivity trade-off that Intent Mismatch identifies as the driver
-
Why do language models avoid correcting false user claims?
Explores whether LLM grounding failures stem from missing knowledge or from conversational dynamics. Examines whether models use face-saving strategies similar to humans when disagreement is needed.
face-saving is a specific pragmatic mechanism consistent with intent alignment gap framing
-
Can models identify what information they actually need?
When a reasoning task is missing a key piece of information, can language models recognize what's absent and ask the right clarifying question? QuestBench tests this capability directly.
the Mediator addresses a separable cognitive deficit: models that solve well-specified problems still cannot identify missing information, which is exactly the Mediator's role
-
Why do reasoning models overthink ill-posed questions?
Explores why models trained for extended reasoning produce drastically longer, less useful responses to unanswerable questions—and whether this represents a fixable training deficit or inherent limitation.
when intent is underspecified, reasoning models overthink rather than recognizing incompleteness; the Mediator architecture bypasses this by separating intent understanding from task execution
-
Can models learn to ask clarifying questions instead of guessing?
Exploring whether large language models can be trained to detect incomplete queries and actively request missing information rather than hallucinating answers or refusing to respond. This matters because conversational agents today remain passive, responding only when prompted.
the trainable capability complement to the Mediator's architectural solution: RL-trained proactive questioning addresses the same intent alignment gap from the capability side
-
Can full episode rewards per step enable better credit assignment?
Can attributing cumulative episode reward to every step in a trajectory, rather than discounting by step distance, actually solve credit assignment in sequential LLM decision-making? This challenges intuitive RL assumptions about how credit should flow backward through time.
training-level fix: MS-GRPO addresses the credit assignment gap that makes single-turn-trained models fail at multi-turn tasks; cumulative episode reward teaches models that earlier decisions affect later outcomes, complementing the Mediator's architectural fix with a training formulation fix
-
Can models learn to ask genuinely useful clarifying questions?
Explores whether question-asking quality is teachable through decomposing it into specific attributes like clarity and relevance, rather than treating it as a monolithic skill.
the Mediator's role in explicating latent requirements demands high-quality question-asking capability; ALFA provides the methodology: decompose question quality into theory-grounded attributes and align via attribute-specific preference optimization
-
Which clarifying questions actually improve user satisfaction?
Not all clarification helps equally. This explores whether asking users to rephrase their needs works as well as asking targeted questions about specific information gaps.
when the Mediator probes for latent intent, question design matters: specific-facet questions that demonstrate understanding outperform need-rephrasing; the Mediator must ask well, not just ask
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- LLMs Get Lost In Multi-Turn Conversation
- Intent Mismatch Causes LLMs to Get Lost in Multi-Turn Conversation
- MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs
- The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs
- Spurious Forgetting in Continual Learning of Language Models
- CollabLLM: From Passive Responders to Active Collaborators
- The Earth is Flat because...: Investigating LLMs' Belief towards Misinformation via Persuasive Conversation
- Large Language Model Reasoning Failures
Original note title
multi-turn performance degradation is an intent alignment gap not an intrinsic capability deficit — decoupling intent understanding from task execution recovers lost performance