Why do language models fail at pronouns across distant segments?

This explores why pronoun resolution — figuring out who 'she' or 'it' refers to — breaks down when the antecedent sits far back in the text, and the corpus suggests it's less a grammar bug than a collision of three weaknesses: distance, structural complexity, and the way models hold identity loosely.

This explores why pronouns lose their referents across distance, and the corpus doesn't have a paper on coreference specifically — but it has the ingredients to explain it from three angles that rarely get stitched together. The first is plain distance. Reasoning and tracking accuracy fall off sharply as inputs get longer, well before the context window is anywhere near full — accuracy dropping from 92% to 68% with just a few thousand tokens of filler Does reasoning ability actually degrade with longer inputs?. A pronoun whose antecedent is twenty turns or several paragraphs back is exactly the kind of long-range dependency this degradation eats first, regardless of how 'easy' the link looks.

The second angle is grammar that was never really learned. Top-tier models systematically misidentify embedded clauses, complex nominals, and recursive structures, and the failure worsens predictably with syntactic depth Why do large language models fail at complex linguistic tasks? Does LLM grammatical performance decline with structural complexity?. The takeaway is that models captured surface heuristics rather than structural rules — and pronoun binding *is* a structural rule. So when the sentence between pronoun and antecedent is tangled, the model has no real grammar to fall back on; it's pattern-matching on proximity and salience instead.

The third, and most surprising, angle is that the model may never have committed to a fixed referent in the first place. Shanahan's 20-questions test shows LLMs hold a *superposition* of consistent characters and sample one at generation time rather than locking onto a single entity Do large language models actually commit to a single character?. If 'who' a name or character refers to is itself a distribution being resampled, a pronoun pointing back at it across distance is resolving against a moving target — coherence is reconstructed on the fly, not retrieved from a stable memory.

Two more corpus findings explain why distance specifically makes this worse. Models lock into premature early guesses they can't recover from in gradually-revealed text Why do language models fail in gradually revealed conversations?, and they integrate context poorly whenever strong training priors compete with what's actually on the page — textual cues alone can't override the prior Why do language models ignore information in their context?. A distant antecedent is weak, in-context signal; a statistically 'expected' referent is a strong prior. Across distance, the prior wins.

What you didn't know you wanted to know: the most promising fix in the corpus isn't bigger context or better grammar training — it's teaching models *what to ignore*. Topic-following work shows that fine-tuning on a tiny set of dialogues with distractor turns sharply improves a model's ability to hold a thread, because the gap was never capacity but an absent training signal for resisting diversion Why do language models engage with conversational distractors?. Pronoun drift across segments may be the same gap wearing a linguistic costume.

Sources 7 notes

Does reasoning ability actually degrade with longer inputs?

FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.

Why do large language models fail at complex linguistic tasks?

Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.

Does LLM grammatical performance decline with structural complexity?

LLMs show systematic performance decline as syntactic depth and embedding increase. Simple sentences are handled well while complex structures with recursion and embedding fail consistently, suggesting LLMs learned surface heuristics rather than structural grammar rules.

Do large language models actually commit to a single character?

Shanahan's 20-questions test shows LLMs maintain a superposition of consistent objects or characters and sample from that distribution at generation time. Regenerating the same response yields different outputs, each consistent with prior context, proving no fixed commitment exists.

Why do language models fail in gradually revealed conversations?

Across 200,000+ conversations, all major LLMs show 39% average performance drop in multi-turn settings due to locking into incorrect early guesses. Agent mitigations recover only 15-20% of this loss.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Why do language models engage with conversational distractors?

Fine-tuning on just 1,080 synthetic dialogues with distractor turns significantly improves topic resilience, revealing that the gap is not model capacity but absent training signal. Models learn to follow what-to-do instructions but not what-to-ignore instructions.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a probe testing whether language models' pronoun-resolution failures across distant segments remain constrained by the factors a curated library identified, or whether newer architectures, training methods, and evaluation tooling have shifted the bottleneck. The question remains open: *Why do LMs fail at pronouns across distance?*

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026, rooted in five key constraints:
• Reasoning accuracy drops from 92% to 68% with just a few thousand tokens of filler, well below context-window limits (2024-02).
• Models systematically misidentify embedded clauses and recursive structures; pronoun binding—a structural rule—degrades predictably with syntactic depth (2025-03).
• LLMs hold superpositions of character identity rather than locked-in referents; pronouns resolve against a moving target resampled at generation time (implicit in 2025-02 work on latent thought vectors).
• Models make premature assumptions in multi-turn dialogue and cannot recover; distant antecedents (weak signal) lose to strong training priors (2026-02).
• Fine-tuning on distractor-rich dialogues sharply improves thread-holding, suggesting the gap is a training signal for *resisting diversion*, not raw capacity (2024-04).

Anchor papers (verify; mind their dates):
• arXiv:2402.14848 (2024-02): input length and reasoning degradation.
• arXiv:2503.19260 (2025-03): grammatical blind spots and structural depth.
• arXiv:2602.07338 (2026-02): intent mismatch in multi-turn collapse.
• arXiv:2602.06176 (2026-02): reasoning failures.

Your task:
(1) RE-TEST EACH CONSTRAINT. For reasoning degradation, syntactic blind spots, superposition-of-referents, premature locking, and prior override: has newer scaling, architectural change (e.g. extended-context training, structured state), or instruction-tuning on coreference-specific tasks since relaxed any? Separate the durable question (which structural rules remain unlearned?) from the perishable limitation (is 92%→68% still the curve?).
(2) Surface the strongest contradicting or superseding work from the last ~6 months on coreference, dialogue coherence, or structural reasoning that refutes or bypasses these findings.
(3) Propose 2 research questions that assume the regime *has* moved: e.g., if distractor-tuning works, does it scale to truly adversarial or ambiguous referents? If superposition is the model's native state, can we intervene at decoding time to lock a referent?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Why do language models fail at pronouns across distant segments?

Sources 7 notes

Next inquiring lines