INQUIRING LINE

How does era sensitivity in legal cases compound with context length failures?

This explores whether two separate weaknesses — LLMs doing worse on older legal precedent, and reasoning breaking down as inputs get longer — share a common root and stack on top of each other when you feed a model a long brief about a historical case.


This explores whether two separate weaknesses — LLMs doing worse on older legal precedent, and reasoning breaking down as inputs get longer — share a common root and stack on top of each other. The corpus suggests they aren't independent failures at all: both trace back to the same thing, which is how unevenly the training data is distributed.

Start with era sensitivity. On a Supreme Court overruling benchmark, models reliably do worse on historical cases than modern ones, and the diagnosed cause isn't the law being harder — it's that recent cases flood the training corpus while older precedent is thinly represented, leaving the model with shallower internal representations of it Why do language models struggle with historical legal cases?. That's not a quirk of legal text. The same fingerprint shows up in temporal reasoning generally: models stay competent on short, structured time questions but start generating impossible timelines in long open-ended contexts, and that breakdown 'tracks training data distribution' as the model falls back on frequency heuristics instead of actually reasoning Why do language models fail at temporal reasoning in complex tasks?. Both failures are the model leaning on what was common in training when the going gets hard.

Now the context-length half — and here's the part most people underestimate. Reasoning accuracy doesn't just degrade near the context window limit; it drops from 92% to 68% with only 3,000 tokens of padding, far below capacity, and chain-of-thought prompting doesn't rescue it Does reasoning ability actually degrade with longer inputs?. A complementary view reframes the bottleneck as not memory but the compute needed to consolidate everything in the window into usable internal state Is long-context bottleneck really about memory or compute?. So a long historical case file is a double tax: the model already has a thin grip on the era, and the length itself is eroding whatever reasoning it could muster.

The compounding mechanism becomes clearer through a third lens — failures are driven by instance-level unfamiliarity, not raw complexity. Models succeed on any reasoning chain when they've seen similar instances and fail at novelty boundaries, because they fit instance patterns rather than general algorithms Do language models fail at reasoning due to complexity or novelty?. A historical case is precisely a low-familiarity instance, and a long document pushes more of that unfamiliar material through exactly the conditions where reasoning is most fragile. The two weaknesses don't add — the era gap makes the content novel, and length is the multiplier on novel content.

What you didn't know you wanted to know: the corpus points to a defense that sidesteps both. A multilingual RAG system built for noisy, drifting historical newspapers wins not by reasoning harder but by aggressively expanding retrieval while forcing the model to refuse any answer it can't ground in evidence — trading coverage for integrity exactly where source quality is degraded Can RAG systems refuse to answer without reliable evidence?. The catch is that long context alone won't substitute for this: long-context models match retrieval on semantic tasks but fail on structured, relational queries, so stuffing the whole case file into the window is the worst of both worlds Can long-context LLMs replace retrieval-augmented generation systems?. The escape hatch isn't a bigger window — it's grounding plus the discipline to abstain when the era is thin and the document is long.


Sources 7 notes

Why do language models struggle with historical legal cases?

Supreme Court overruling benchmark (236 pairs) reveals era sensitivity: models perform worse on historical cases than modern ones. Root cause is training corpus over-representation of recent cases, creating shallower representations of older precedent.

Why do language models fail at temporal reasoning in complex tasks?

LLMs maintain basic temporal competence in simple structured formats but generate temporally impossible relationships in long, open-ended contexts. This degradation tracks training data distribution and emerges as models rely on frequency heuristics rather than structured reasoning under complexity.

Does reasoning ability actually degrade with longer inputs?

FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.

Is long-context bottleneck really about memory or compute?

Research shows the bottleneck is not memory capacity but the compute required to consolidate evicted context into fast weights during offline sleep phases. Performance improves with more consolidation passes, following a test-time scaling pattern on harder reasoning tasks.

Do language models fail at reasoning due to complexity or novelty?

LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.

Can RAG systems refuse to answer without reliable evidence?

A multilingual RAG system for noisy historical newspapers succeeds by aggressively expanding retrieval while constraining generation to only grounded answers. The grounded-refusal prompt prevents hallucination when OCR errors and language drift degrade source quality, trading coverage for integrity.

Can long-context LLMs replace retrieval-augmented generation systems?

The LOFT benchmark shows LCLMs match RAG on semantic retrieval without explicit training, but cannot execute relational queries requiring joins across structured tables. Context length alone cannot bridge this gap.

Next inquiring lines