Why do large language models fail at temporal reasoning in complex legal cases?
This explores why LLMs stumble on time-ordering in hard legal cases — and the corpus suggests it's not really about law at all, but about how these models handle time, complexity, and unfamiliar material everywhere.
This explores why LLMs stumble on time-ordering in hard legal cases — and the interesting thing is that the corpus treats "legal" as almost incidental. The failure is really three separate weaknesses stacking on top of each other. The first is about time itself. LLMs are decent at causal reasoning ("X caused Y") because causal connectives are spelled out explicitly and often in training text, but temporal order is usually implicit and has to be inferred from context — so it's the weaker muscle from the start Why do LLMs handle causal reasoning better than temporal reasoning?. There's even an argument that the model has no real sense of time at all: token generation is sequential but atemporal, ordering tokens by probability without any duration or reflection between them Does AI text generation unfold through temporal reflection?.
The second weakness is complexity, and here the corpus is sharp: models keep basic temporal competence in short, structured prompts but start producing temporally impossible relationships in long, open-ended ones, falling back on frequency heuristics instead of structured reasoning as the input gets messier Why do language models fail at temporal reasoning in complex tasks?. And "long" arrives sooner than you'd think — reasoning accuracy can drop from 92% to 68% with just a few thousand tokens of padding, far below the context window limit, even with chain-of-thought Does reasoning ability actually degrade with longer inputs?. A complex legal case is exactly this: a long, tangled, multi-party timeline.
The third weakness is specific to law as a domain. Models perform measurably worse on historical legal cases than modern ones, because training corpora over-represent recent material and form shallower representations of older precedent Why do language models struggle with historical legal cases?. That matters for temporal reasoning in law because legal time-ordering often hinges on which precedent came first and whether a later case overruled an earlier one — exactly the older material the model knows least well.
Here's the part you might not expect: the deeper cause may not be "complexity" but unfamiliarity. One line of work argues reasoning models don't break at some complexity threshold — they break at instance-novelty boundaries, fitting patterns from similar examples they've seen rather than running a general algorithm Do language models fail at reasoning due to complexity or novelty?. A novel legal fact pattern with an unusual timeline is novel twice over. Related work reframes some "reasoning" collapses as execution failures — the model knows the procedure but can't carry out many steps in pure text, and does better when given tools Are reasoning model collapses really failures of reasoning?. And the unsettling "potemkin" pattern shows models can correctly explain a concept, fail to apply it, and recognize the failure — explanation and execution running on disconnected tracks Can LLMs understand concepts they cannot apply?. So a model can recite the rule for ordering precedents and still get the order wrong.
The through-line, drawn across all these: this isn't a legal-knowledge gap you fix with more case law. It's the predictable behavior of a probability machine that handles implicit relations worse than explicit ones, degrades with length, and leans on familiarity over algorithm — a pattern general enough that researchers can forecast where it'll appear from the model's autoregressive nature alone Can we predict where language models will fail?.
Sources 9 notes
ChatGPT excels at causal relations but struggles with temporal ordering because causal connectives are explicit and frequent in training data, while temporal order is often implicit and must be inferred contextually.
Token ordering in LLMs follows probabilistic selection without intervening reflection or revision. Human discourse gains meaning from temporal structure—time spent thinking changes what comes next—but AI text production lacks this duration-in-reflection despite appearing sequentially composed.
LLMs maintain basic temporal competence in simple structured formats but generate temporally impossible relationships in long, open-ended contexts. This degradation tracks training data distribution and emerges as models rely on frequency heuristics rather than structured reasoning under complexity.
FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.
Supreme Court overruling benchmark (236 pairs) reveals era sensitivity: models perform worse on historical cases than modern ones. Root cause is training corpus over-representation of recent cases, creating shallower representations of older precedent.
LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.
Models confined to text-only generation cannot execute multi-step procedures at scale, even when they know the underlying algorithm. Tool-enabled models solve problems beyond the supposed reasoning cliff, suggesting the bottleneck is procedural execution bandwidth.
Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.
By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.