Why do language models lose performance in longer conversations?

Does multi-turn degradation stem from fundamental model limitations, or from misalignment between what users mean and what models assume? Understanding the root cause could guide better solutions.

Synthesis note · 2026-02-22 · sourced from Conversation Topics Dialog

The Intent Mismatch paper offers a fundamentally different explanation for why LLMs get lost in multi-turn conversation. Where Laban et al. attribute the ~30% degradation to model unreliability, this paper argues the root cause is pragmatic mismatch between user expression and model interpretation — an intent alignment gap, not a capability deficit.

Two reframing moves are critical:

First, making premature assumptions is not erroneous behavior but a rational strategy induced by RLHF training. The dominant training objective rewards helpfulness and penalizes evasive responses. Under incomplete information, the model constructs a plausible task formulation for a "typical" user and produces a provisional answer — because that is what the training signal demands. The model is doing exactly what we trained it to do; the problem is that we trained it for the wrong thing in multi-turn contexts.

Second, the bottleneck is not model capacity or reasoning depth but pragmatic mismatch. Users exhibit systematic individual variation — the same utterance may map to disparate underlying intentions. General-purpose LLMs, aligned to the "average" user, cannot adapt to idiosyncratic behaviors. Models frequently misinterpret fragmentary continuations as confirmations rather than corrections, reinforcing incorrect context.

The proposed fix is architectural: a Mediator-Assistant pipeline that decouples intent understanding from task execution. The Mediator explicates user inputs — articulating latent requirements before they reach the execution Assistant. An LLM-based Refiner distills explicit guidelines from discrepancies between failed and successful interaction trajectories. This enables adaptation to individual user behaviors without weight updates.

The theoretical claim is strong: scaling model size or improving training alone cannot resolve this gap, because it arises from structural ambiguity in conversational context rather than representational limitations. This challenges the implicit assumption that bigger/better models will solve multi-turn problems. The QuestBench finding reinforces this: since Can models identify what information they actually need?, the Mediator's role in explicating latent requirements addresses a capability that models demonstrably lack — they cannot identify what information is missing even when they can solve the fully-specified version of the problem. The intent alignment gap is thus not just about pragmatic mismatch but about a separable cognitive deficit in information gathering. Furthermore, since Why do reasoning models overthink ill-posed questions?, when intent is genuinely underspecified (as it is in most multi-turn conversation), reasoning models compound the problem by overthinking rather than recognizing incompleteness — making the Mediator architecture even more necessary.

Since Why do language models respond passively instead of asking clarifying questions?, CollabLLM's reward-signal fix and this paper's architectural fix represent complementary intervention levels for the same underlying problem.

The multi-turn degradation problem exists on both sides of the interaction. User simulators — the systems that conversational agents train against — exhibit the same goal misalignment: they "struggle to consistently adhere to their user goals throughout conversations," failing to maintain profiles, manage multiple objectives, or complete within conversation limits. When simulators drift, they generate conversations that teach agents wrong behaviors through misleading reward signals. See Why do LLM user simulators fail to track their own goals?. This is the evaluation-side manifestation: agent degradation and evaluation degradation compound each other.

Inquiring lines that use this note as a source 41

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 10

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

21 direct connections · 167 in 2-hop network ·medium cluster Open in graph ↗

Why do language models lose performance in longe… Why do language models respond passively instead o… Why do language models fail in gradually revealed … Does preference optimization harm conversational u… Why do language models avoid correcting false user… Can models identify what information they actually… Why do reasoning models overthink ill-posed questi… Can models learn to ask clarifying questions inste… Can full episode rewards per step enable better cr…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Why do language models respond passively instead of asking clarifying questions? Explores whether the reward signals used to train language models might actively discourage them from seeking clarification or taking initiative in conversations, and what alternative training approaches might enable more collaborative dialogue.
complementary intervention: reward fix (CollabLLM) vs. architecture fix (Mediator-Assistant)
Why do language models fail in gradually revealed conversations? Explores why LLMs perform 39% worse when instructions arrive incrementally rather than upfront, and whether they can recover from early mistakes in multi-turn dialogue.
the phenomenon this paper reinterprets
Does preference optimization harm conversational understanding? Exploring whether RLHF training that rewards confident, complete responses undermines the grounding acts—clarifications, checks, acknowledgments—that actually build shared understanding in dialogue.
RLHF causing the helpfulness-passivity trade-off that Intent Mismatch identifies as the driver
Why do language models avoid correcting false user claims? Explores whether LLM grounding failures stem from missing knowledge or from conversational dynamics. Examines whether models use face-saving strategies similar to humans when disagreement is needed.
face-saving is a specific pragmatic mechanism consistent with intent alignment gap framing
Can models identify what information they actually need? When a reasoning task is missing a key piece of information, can language models recognize what's absent and ask the right clarifying question? QuestBench tests this capability directly.
the Mediator addresses a separable cognitive deficit: models that solve well-specified problems still cannot identify missing information, which is exactly the Mediator's role
Why do reasoning models overthink ill-posed questions? Explores why models trained for extended reasoning produce drastically longer, less useful responses to unanswerable questions—and whether this represents a fixable training deficit or inherent limitation.
when intent is underspecified, reasoning models overthink rather than recognizing incompleteness; the Mediator architecture bypasses this by separating intent understanding from task execution
Can models learn to ask clarifying questions instead of guessing? Exploring whether large language models can be trained to detect incomplete queries and actively request missing information rather than hallucinating answers or refusing to respond. This matters because conversational agents today remain passive, responding only when prompted.
the trainable capability complement to the Mediator's architectural solution: RL-trained proactive questioning addresses the same intent alignment gap from the capability side
Can full episode rewards per step enable better credit assignment? Can attributing cumulative episode reward to every step in a trajectory, rather than discounting by step distance, actually solve credit assignment in sequential LLM decision-making? This challenges intuitive RL assumptions about how credit should flow backward through time.
training-level fix: MS-GRPO addresses the credit assignment gap that makes single-turn-trained models fail at multi-turn tasks; cumulative episode reward teaches models that earlier decisions affect later outcomes, complementing the Mediator's architectural fix with a training formulation fix
Can models learn to ask genuinely useful clarifying questions? Explores whether question-asking quality is teachable through decomposing it into specific attributes like clarity and relevance, rather than treating it as a monolithic skill.
the Mediator's role in explicating latent requirements demands high-quality question-asking capability; ALFA provides the methodology: decompose question quality into theory-grounded attributes and align via attribute-specific preference optimization
Which clarifying questions actually improve user satisfaction? Not all clarification helps equally. This explores whether asking users to rephrase their needs works as well as asking targeted questions about specific information gaps.
when the Mediator probes for latent intent, question design matters: specific-facet questions that demonstrate understanding outperform need-rephrasing; the Mediator must ask well, not just ask

Why do language models lose performance in longer conversations?

Related concepts in this collection 10

Related papers in this collection 8

Search by related questions 4