How do different legal AI tools compare in accuracy across case eras?
This reads as two linked questions — whether the commercial legal AI tools (Lexis+ AI, Westlaw, etc.) differ in how accurate they are, and whether that accuracy shifts depending on how old the cases are — so it's worth saying up front: the corpus has strong evidence on each half separately, but no single study cross-tabulates each tool against each era.
This explores both how legal AI tools stack up against each other and how their accuracy moves across case eras — and the honest answer is the collection treats these as two findings rather than one matrix. On the tool-comparison side, a preregistered evaluation found that Lexis+ AI, Westlaw AI-Assisted Research, and Ask Practical Law AI all hallucinate citations between 17% and 33% of the time, despite being marketed as 'hallucination-free' How often do legal AI tools actually hallucinate citations?. So the comparison across tools is less a leaderboard and more a shared failure: the differences between vendors are smaller than the gap between what they claim and what they do. A key structural problem is that these are closed systems — you can't independently inspect their retrieval, which makes any clean accuracy comparison hard to run from the outside.
The 'case eras' half is where the more surprising finding lives. Tested on a Supreme Court overruling benchmark, models perform systematically worse on historical cases than on modern ones Why do language models struggle with historical legal cases?. The cause is mundane but consequential: training corpora over-represent recent cases, so the model's representation of older precedent is shallower. This means accuracy isn't a fixed property of a tool — it drifts with the age of what you're asking about. A tool that looks reliable on a 2020 ruling can quietly degrade on a 1950 one, and nothing in the interface tells you that.
Put the two halves together and you get the real shape of the question. The interesting variation may be less *between* tools than *within* each tool across eras — a dimension none of the vendor comparisons foreground. If every tool leans on the same recency-skewed training data, era sensitivity could be a common-mode weakness that a head-to-head benchmark on modern cases would never expose.
Why this is hard to fix points to deeper material in the corpus. One argument holds that AI-generated knowledge is structurally hearsay — testimony at a remove, modified in each retelling, unverifiable against a stable source — so the very tools law uses to check citations (the evidentiary chain, the archive) can't process AI output by design Does AI-generated knowledge have the same structure as hearsay?. That reframes hallucination not as a bug to patch but as a property of how these systems relate to sources, which is exactly why older, thinly-represented precedent is most at risk.
For where the field is heading, the corpus offers two doorways. Rationale-driven evidence selection — having the model explain *why* a passage is relevant rather than just matching it by similarity — beat similarity re-ranking by 33% with half the chunks, tested partly in the legal domain Can rationale-driven selection beat similarity re-ranking for evidence?. And formal argumentation frameworks structure outputs as contestable attack/defense graphs, so a user can point to the exact premise they reject instead of distrusting the whole answer Can formal argumentation make AI decisions truly contestable?. Both are bets that the route to trustworthy legal AI runs through verifiability and contestability — not through a better score on a modern-case benchmark.
Sources 5 notes
A preregistered evaluation found that Lexis+ AI, Westlaw AI-Assisted Research, and Ask Practical Law AI hallucinate between 17% and 33% of the time—far higher than vendors claim. Closed-system design prevents independent verification and accountability.
Supreme Court overruling benchmark (236 pairs) reveals era sensitivity: models perform worse on historical cases than modern ones. Root cause is training corpus over-representation of recent cases, creating shallower representations of older precedent.
AI output shares all defining features of hearsay: testimony at remove, modification in retelling, unattributable origin, and unverifiability against stable sources. This means Enlightenment verification tools—citation, archiving, peer review, evidentiary chains—cannot process AI output by design.
METEORA uses LLM-generated rationales with flagging instructions to select evidence, achieving 33% better accuracy with 50% fewer chunks than similarity re-ranking across legal, financial, and academic domains. The method also improves adversarial robustness substantially.
Dung-style argumentation structures AI outputs as traversable attack/defense graphs, allowing users to identify and contest specific premises. Standard LLM outputs lack this structure, making it impossible to pinpoint which claims users actually reject.