What accounts for performance drops in multi-turn agent interactions?
This explores why AI agents and assistants get worse as interactions stretch across many turns — and the corpus points to several distinct failure mechanisms, not one.
This explores why AI agents and assistants get worse as interactions stretch across many turns. The corpus suggests there isn't a single cause — performance drops trace to at least three separable mechanisms: premature commitment, memory degradation, and coordination breakdown. Sorting out which one you're hitting changes the fix.
The most direct culprit is what one note calls the wrong-turn problem: models score ~90% on a single-shot instruction but fall to ~65% when the same information arrives gradually across a conversation Why do AI assistants get worse at longer conversations?. The model locks into an early guess and can't course-correct. Crucially, this is framed as a training artifact, not a capacity limit — RLHF rewards confidently helpful answers over asking a clarifying question. The same root shows up from a different angle in work on proactive agents: next-turn reward optimization structurally strips out initiative, so models won't pause to clarify even when they should — yet that behavior is trainable, jumping from 0.15% to ~74% with the right RL signal Why do AI agents fail to take initiative?. So part of the multi-turn drop is self-inflicted by how we trained for short-horizon helpfulness.
A second mechanism is memory. As history accumulates, naive context handling degrades. One line of work decomposes agent working memory into four components across two time scales — dialogue-level (conversation history, scratchpad) versus turn-level (examples, task trajectory) — and argues each has its own failure mode and update policy, so a single undifferentiated context window is the wrong design How should agent memory split across time scales?. The proposed remedy is structured consolidation: agents that autonomously fold past interactions into episodic, working, and tool memory schemas cut token overhead and avoid the degradation that poorly designed compression causes Can agents compress their own memory without losing critical details?. A broader claim ties this together — reliability comes not from a bigger model but from externalizing memory, skills, and protocols into a harness layer so the model isn't re-solving the same state-tracking problem every turn Where does agent reliability actually come from?.
A third mechanism only appears once you have multiple agents or longer interaction chains: coordination decays predictably with scale. Agents agree on strategies too late, or adopt them without telling their neighbors, and — tellingly — they accept incoming information without verifying it, which lets a single error propagate through the network even though each agent could detect a direct conflict if it looked Why do multi-agent systems fail to coordinate at scale?. That uncritical acceptance is the multi-agent cousin of the single model's premature lock-in.
The genuinely useful twist: more turns aren't always the problem — sometimes they're the cure. Test-time interaction scaling treats added environment steps as a distinct axis from deeper per-step reasoning, and on partially observable tasks the ability to explore, backtrack, and replan across turns is exactly what drives state-of-the-art results Does agent interaction time scale separately from reasoning depth?. So the question isn't really 'do more turns hurt?' but 'does your harness let the agent revise, or only accumulate?' The degradation comes from architectures that can't course-correct, can't structure their memory, and accept information uncritically — not from length itself.
Sources 7 notes
LLMs perform at 90% accuracy with single-message instructions but drop to 65% across natural conversation. Models lock into early guesses when information arrives gradually and cannot course-correct, a behavior induced by RLHF training that rewards helpfulness over clarification.
Research shows next-turn reward optimization structurally removes initiative from models, but proactive behaviors like critical thinking and clarification-seeking are trainable (0.15% to 73.98% with RL). The core challenge is balancing proactivity with civility to avoid intrusion.
RAISE shows that agent memory consists of four components organized by two design axes: dialogue-level (conversation history, scratchpad) versus turn-level (examples, task trajectory). This granularity distinction predicts different failure modes and update policies for each component.
DeepAgent's autonomous memory folding consolidates interaction history into episodic, working, and tool memory schemas. This reduces token overhead while letting agents pause to reconsider strategies—the autonomy and structure together avoid degradation that plagues poorly designed consolidation.
Research shows reliable LLM agents externalize three cognitive burdens—memory (state persistence), skills (procedural components), and protocols (structured interaction)—into a harness layer rather than relying on model scale alone. The harness unifies these externalities and eliminates the need for the model to solve the same problems repeatedly.
AgentsNet benchmark shows agents fail to coordinate strategies either by agreeing too late or adopting strategies without informing neighbors. Agents accept neighbor information without verification, enabling error propagation while remaining capable of detecting direct conflicts.
Test-time interaction—increasing environment steps—enables exploration, backtracking, and replanning that per-step reasoning cannot achieve. Curriculum-based RL on rollout length produces SOTA web agents, showing interaction scaling dominates on tasks with partial observability.