Why do longer context windows alone fail to capture temporal dynamics in dialogue?
This explores why simply expanding how much text a model can hold in context doesn't let it track how a conversation *moves*—the rhythm, drift, and evolving intent that unfold across turns.
This explores why simply expanding how much text a model can hold in context doesn't let it track how a conversation *moves* over time. The corpus points to a clean separation: context length is about storage, but temporal dynamics are about *transformation*—and the two don't trade off. One line of research argues the real bottleneck isn't memory at all but the compute needed to consolidate earlier turns into internal state; bigger windows just hold more raw tokens without ever digesting them into something the model reasons *from* Is long-context bottleneck really about memory or compute?. Tellingly, models start losing the thread well before they run out of room: reasoning accuracy can fall from 92% to 68% with just a few thousand tokens of padding, far below any context limit Does reasoning ability actually degrade with longer inputs?. More space, in other words, can actively hurt.
The deeper issue is that a longer window is still a *flat* window—it treats turn 1 and turn 20 as interchangeable tokens, when dialogue is a layered temporal object. One framing models conversation as having four simultaneous streams—linguistic complexity, emotional trajectory, topic coherence, relevance—that evolve as trajectories, not snapshots; statistical pooling over a big buffer flattens exactly the structure that matters Can tracking dialogue dimensions simultaneously reveal hidden conversation patterns?. Relatedly, LLMs are simply weaker at temporal ordering than at causal links, because causal connectives appear explicitly in training data while temporal order must be inferred—so even with the tokens present, the model doesn't natively encode *when* relative to *what* Why do LLMs handle causal reasoning better than temporal reasoning?.
Dialogue also has a two-sided, belief-tracking quality that raw context can't supply. Human conversation builds shared context cooperatively, each turn renegotiating common ground; a prompt instead bundles utterance and context into a single static frame the model can't revise mid-stream How do prompts reshape the role of context in AI conversation?. Frameworks that *do* capture temporal progression add machinery the token stream lacks—collaborative rational speech acts give an information-theoretic account of how partial understanding becomes shared across turns Can dialogue systems track both speakers' beliefs across turns?—precisely the bidirectional belief evolution a longer buffer never represents.
What looks like a 'memory' failure is often an alignment or consistency failure that more context can't touch. Multi-turn degradation has been traced not to lost capability but to a pragmatic gap: RLHF rewards premature answers over clarification, so the model drifts from user intent regardless of window size—fixable by parsing intent explicitly, not by adding tokens Why do language models lose performance in longer conversations?. Persona drift compounds this: models sample a fresh character each generation rather than committing to one Do large language models actually commit to a single character?, and reducing that drift takes turn-level consistency rewards, not bigger context Can training user simulators reduce persona drift in dialogue?.
The interesting twist is what *does* help, and it isn't length. Recursive approaches that treat a long prompt as an external environment to be queried—rather than one giant attention pass—outperform base models even on short inputs, because they sidestep attention degradation entirely Can models treat long prompts as external code environments?. And in-context learning of sequential behavior depends on *trajectory* structure—same-environment sequences, not isolated examples—suggesting models capture dynamics only when the temporal shape is handed to them explicitly Why do trajectories matter more than individual examples for in-context learning?. The throughline: temporal dynamics are something you have to *architect*, not something that emerges from giving the model more room to read.
Sources 11 notes
Research shows the bottleneck is not memory capacity but the compute required to consolidate evicted context into fast weights during offline sleep phases. Performance improves with more consolidation passes, following a test-time scaling pattern on harder reasoning tasks.
FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.
Conversational DNA encodes four simultaneous dimensions—linguistic complexity, emotional trajectories, topic coherence, and conversational relevance—as temporal streams. The reverse Turing test finding showed expert assessments of AI diverged sharply, suggesting conversational structure shapes interpretation as much as content.
ChatGPT excels at causal relations but struggles with temporal ordering because causal connectives are explicit and frequent in training data, while temporal order is often implicit and must be inferred contextually.
LLM prompts bundle utterance, context assignment, and role specification into a single static frame the model cannot renegotiate, unlike human dialogue where context evolves cooperatively. This makes mid-conversation pivots require explicit re-prompting rather than implicit adjustment.
CRSA integrates rate-distortion theory with RSA to enable bidirectional belief tracking across dialogue turns. Demonstrated on referential games and doctor-patient dialogues, it captures progression from partial to shared understanding, providing the information-theoretic framework that token-level LLM systems lack.
LLMs degrade in multi-turn settings because RLHF training rewards premature answers over clarification-seeking, creating pragmatic mismatch with individual user behaviors. A Mediator-Assistant architecture that explicitly parses user intent before execution recovers lost performance without retraining.
Shanahan's 20-questions test shows LLMs maintain a superposition of consistent objects or characters and sample from that distribution at generation time. Regenerating the same response yields different outputs, each consistent with prior context, proving no fixed commitment exists.
By inverting standard RL setups to train user simulators for consistency using three complementary metrics (prompt-to-line, line-to-line, Q&A consistency) as reward signals, persona drift decreases by over 55%. This approach captures distinct failure types: local drift within turns, global drift across conversations, and factual contradictions.
Recursive Language Models store long prompts in a Python REPL and query them via code execution, avoiding attention degradation. RLMs outperform base models even on shorter prompts while handling inputs two orders of magnitude beyond context windows.
In-context learning for sequential decision-making requires full or partial trajectories from the same environment level, not just isolated examples. This structural property—trajectory burstiness—allows models to generalize across vastly different tasks without weight updates.