Why do language models fail at pronouns across distant segments?
This explores why pronoun resolution — figuring out who 'she' or 'it' refers to — breaks down when the antecedent sits far back in the text, and the corpus suggests it's less a grammar bug than a collision of three weaknesses: distance, structural complexity, and the way models hold identity loosely.
This explores why pronouns lose their referents across distance, and the corpus doesn't have a paper on coreference specifically — but it has the ingredients to explain it from three angles that rarely get stitched together. The first is plain distance. Reasoning and tracking accuracy fall off sharply as inputs get longer, well before the context window is anywhere near full — accuracy dropping from 92% to 68% with just a few thousand tokens of filler Does reasoning ability actually degrade with longer inputs?. A pronoun whose antecedent is twenty turns or several paragraphs back is exactly the kind of long-range dependency this degradation eats first, regardless of how 'easy' the link looks.
The second angle is grammar that was never really learned. Top-tier models systematically misidentify embedded clauses, complex nominals, and recursive structures, and the failure worsens predictably with syntactic depth Why do large language models fail at complex linguistic tasks? Does LLM grammatical performance decline with structural complexity?. The takeaway is that models captured surface heuristics rather than structural rules — and pronoun binding *is* a structural rule. So when the sentence between pronoun and antecedent is tangled, the model has no real grammar to fall back on; it's pattern-matching on proximity and salience instead.
The third, and most surprising, angle is that the model may never have committed to a fixed referent in the first place. Shanahan's 20-questions test shows LLMs hold a *superposition* of consistent characters and sample one at generation time rather than locking onto a single entity Do large language models actually commit to a single character?. If 'who' a name or character refers to is itself a distribution being resampled, a pronoun pointing back at it across distance is resolving against a moving target — coherence is reconstructed on the fly, not retrieved from a stable memory.
Two more corpus findings explain why distance specifically makes this worse. Models lock into premature early guesses they can't recover from in gradually-revealed text Why do language models fail in gradually revealed conversations?, and they integrate context poorly whenever strong training priors compete with what's actually on the page — textual cues alone can't override the prior Why do language models ignore information in their context?. A distant antecedent is weak, in-context signal; a statistically 'expected' referent is a strong prior. Across distance, the prior wins.
What you didn't know you wanted to know: the most promising fix in the corpus isn't bigger context or better grammar training — it's teaching models *what to ignore*. Topic-following work shows that fine-tuning on a tiny set of dialogues with distractor turns sharply improves a model's ability to hold a thread, because the gap was never capacity but an absent training signal for resisting diversion Why do language models engage with conversational distractors?. Pronoun drift across segments may be the same gap wearing a linguistic costume.
Sources 7 notes
FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.
Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.
LLMs show systematic performance decline as syntactic depth and embedding increase. Simple sentences are handled well while complex structures with recursion and embedding fail consistently, suggesting LLMs learned surface heuristics rather than structural grammar rules.
Shanahan's 20-questions test shows LLMs maintain a superposition of consistent objects or characters and sample from that distribution at generation time. Regenerating the same response yields different outputs, each consistent with prior context, proving no fixed commitment exists.
Across 200,000+ conversations, all major LLMs show 39% average performance drop in multi-turn settings due to locking into incorrect early guesses. Agent mitigations recover only 15-20% of this loss.
Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.
Fine-tuning on just 1,080 synthetic dialogues with distractor turns significantly improves topic resilience, revealing that the gap is not model capacity but absent training signal. Models learn to follow what-to-do instructions but not what-to-ignore instructions.