Why do language models fail at coreference across long contexts?
This explores why models lose track of who or what a pronoun or name refers to as text gets longer — and the corpus doesn't tackle coreference head-on, so it answers by triangulating from work on linguistic structure, identity-tracking, relational queries, and length-driven decay.
This reads the question as: why do models lose the thread of *who* and *what* the words refer to once the context stretches out? No paper here studies coreference by name, but several circle the same territory from different angles, and together they suggest the failure isn't one bug but a stack of them. The most direct neighbor is the finding that LLMs make systematic grammatical errors that get predictably worse with structural complexity — they misidentify embedded clauses, verb phrases, and nested nominals because statistical learning captures surface patterns rather than the deep rules that bind a pronoun to its antecedent Why do large language models fail at complex linguistic tasks?. Coreference is exactly that kind of binding, so the same crack shows up wherever a sentence buries the referent under syntactic depth.
The more surprising contributor is identity itself. One line of work argues that a model never *commits* to a fixed character or entity — it holds a superposition of mutually consistent possibilities and samples one at generation time, so regenerating the same passage yields a different but locally-coherent reading Do large language models actually commit to a single character?. If there's no settled internal answer to "who is 'she'?", then coreference across a long span isn't being resolved and held — it's being re-improvised. That reframes the problem: the model isn't forgetting the antecedent, it never pinned it down in the first place.
Length then turns a fragile process into a failing one. Reasoning accuracy collapses with input length *far below* the context window — dropping sharply at only a few thousand tokens of padding, in a way that doesn't track language-modeling quality and survives chain-of-thought prompting Does reasoning ability actually degrade with longer inputs?. So "the context fit" is not the same as "the context was usable." A related result locates the real bottleneck not in memory capacity but in the *compute* needed to consolidate distant context into usable internal state — the model has the tokens but hasn't done the work to integrate them Is long-context bottleneck really about memory or compute?.
There's also a structural ceiling worth knowing about: long-context models can do semantic retrieval over a big window, but they fail on queries that require *relational joins* — linking this entity here to that mention there Can long-context LLMs replace retrieval-augmented generation systems?. Coreference is a join. And when a referent in the text conflicts with what the model learned in training, the parametric prior can simply override the context, so the model resolves the pronoun to its expectation rather than to the passage Why do language models ignore information in their context?.
The thing you may not have expected: this looks less like a memory problem than the phrase "long context" implies. The corpus points to a model that doesn't firmly fix entities, can't reliably perform the relational joins coreference demands, degrades well before its window is full, and will overwrite the text with its priors when they're strong. Coreference fails at long range because all four are true at once — and the same fragility shows up in multi-turn conversations, where models drift from the user's actual intent as the exchange lengthens Why do language models lose performance in longer conversations?.
Sources 7 notes
Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.
Shanahan's 20-questions test shows LLMs maintain a superposition of consistent objects or characters and sample from that distribution at generation time. Regenerating the same response yields different outputs, each consistent with prior context, proving no fixed commitment exists.
FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.
Research shows the bottleneck is not memory capacity but the compute required to consolidate evicted context into fast weights during offline sleep phases. Performance improves with more consolidation passes, following a test-time scaling pattern on harder reasoning tasks.
The LOFT benchmark shows LCLMs match RAG on semantic retrieval without explicit training, but cannot execute relational queries requiring joins across structured tables. Context length alone cannot bridge this gap.
Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.
LLMs degrade in multi-turn settings because RLHF training rewards premature answers over clarification-seeking, creating pragmatic mismatch with individual user behaviors. A Mediator-Assistant architecture that explicitly parses user intent before execution recovers lost performance without retraining.