How do different legal AI tools compare in accuracy across case eras?

This reads as two linked questions — whether the commercial legal AI tools (Lexis+ AI, Westlaw, etc.) differ in how accurate they are, and whether that accuracy shifts depending on how old the cases are — so it's worth saying up front: the corpus has strong evidence on each half separately, but no single study cross-tabulates each tool against each era.

This explores both how legal AI tools stack up against each other and how their accuracy moves across case eras — and the honest answer is the collection treats these as two findings rather than one matrix. On the tool-comparison side, a preregistered evaluation found that Lexis+ AI, Westlaw AI-Assisted Research, and Ask Practical Law AI all hallucinate citations between 17% and 33% of the time, despite being marketed as 'hallucination-free' How often do legal AI tools actually hallucinate citations?. So the comparison across tools is less a leaderboard and more a shared failure: the differences between vendors are smaller than the gap between what they claim and what they do. A key structural problem is that these are closed systems — you can't independently inspect their retrieval, which makes any clean accuracy comparison hard to run from the outside.

The 'case eras' half is where the more surprising finding lives. Tested on a Supreme Court overruling benchmark, models perform systematically worse on historical cases than on modern ones Why do language models struggle with historical legal cases?. The cause is mundane but consequential: training corpora over-represent recent cases, so the model's representation of older precedent is shallower. This means accuracy isn't a fixed property of a tool — it drifts with the age of what you're asking about. A tool that looks reliable on a 2020 ruling can quietly degrade on a 1950 one, and nothing in the interface tells you that.

Put the two halves together and you get the real shape of the question. The interesting variation may be less *between* tools than *within* each tool across eras — a dimension none of the vendor comparisons foreground. If every tool leans on the same recency-skewed training data, era sensitivity could be a common-mode weakness that a head-to-head benchmark on modern cases would never expose.

Why this is hard to fix points to deeper material in the corpus. One argument holds that AI-generated knowledge is structurally hearsay — testimony at a remove, modified in each retelling, unverifiable against a stable source — so the very tools law uses to check citations (the evidentiary chain, the archive) can't process AI output by design Does AI-generated knowledge have the same structure as hearsay?. That reframes hallucination not as a bug to patch but as a property of how these systems relate to sources, which is exactly why older, thinly-represented precedent is most at risk.

For where the field is heading, the corpus offers two doorways. Rationale-driven evidence selection — having the model explain *why* a passage is relevant rather than just matching it by similarity — beat similarity re-ranking by 33% with half the chunks, tested partly in the legal domain Can rationale-driven selection beat similarity re-ranking for evidence?. And formal argumentation frameworks structure outputs as contestable attack/defense graphs, so a user can point to the exact premise they reject instead of distrusting the whole answer Can formal argumentation make AI decisions truly contestable?. Both are bets that the route to trustworthy legal AI runs through verifiability and contestability — not through a better score on a modern-case benchmark.

Sources 5 notes

How often do legal AI tools actually hallucinate citations?

A preregistered evaluation found that Lexis+ AI, Westlaw AI-Assisted Research, and Ask Practical Law AI hallucinate between 17% and 33% of the time—far higher than vendors claim. Closed-system design prevents independent verification and accountability.

Why do language models struggle with historical legal cases?

Supreme Court overruling benchmark (236 pairs) reveals era sensitivity: models perform worse on historical cases than modern ones. Root cause is training corpus over-representation of recent cases, creating shallower representations of older precedent.

Does AI-generated knowledge have the same structure as hearsay?

AI output shares all defining features of hearsay: testimony at remove, modification in retelling, unattributable origin, and unverifiability against stable sources. This means Enlightenment verification tools—citation, archiving, peer review, evidentiary chains—cannot process AI output by design.

Can rationale-driven selection beat similarity re-ranking for evidence?

METEORA uses LLM-generated rationales with flagging instructions to select evidence, achieving 33% better accuracy with 50% fewer chunks than similarity re-ranking across legal, financial, and academic domains. The method also improves adversarial robustness substantially.

Can formal argumentation make AI decisions truly contestable?

Dung-style argumentation structures AI outputs as traversable attack/defense graphs, allowing users to identify and contest specific premises. Standard LLM outputs lack this structure, making it impossible to pinpoint which claims users actually reject.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a legal AI systems analyst. The question remains open: Do different legal AI tools genuinely differ in accuracy, or do they share common failure modes that erase tool-level variation — especially across case eras?

What a curated library found — and when (dated claims, not current truth):

Findings span 2023–2026; treat as perishable constraints:
• Lexis+, Westlaw, and Ask Practical Law AI all hallucinate citations at 17–33% rates despite 'hallucination-free' claims (arXiv:2405.20362, 2024-05).
• Models perform systematically worse on historical Supreme Court cases than modern ones, driven by training-corpus recency bias (arXiv:2510.20941, 2025-10).
• Rationale-driven evidence selection beats similarity re-ranking by 33% with half the retrieval chunks (arXiv:2505.16014, 2025-05).
• Argumentative frameworks using attack/defense graphs enable contestable, verifiable outputs (arXiv:2405.02079, 2024-05).
• LLM-as-judge reliability itself is under scrutiny across bias and consistency (arXiv:2412.12509, 2024-12).

Anchor papers (verify; mind their dates):
- arXiv:2405.20362 (Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools, 2024-05)
- arXiv:2510.20941 (Do LLMs Truly Understand When a Precedent Is Overruled?, 2025-10)
- arXiv:2505.16014 (Ranking Free RAG: Replacing Re-ranking with Selection in RAG for Sensitive Domains, 2025-05)
- arXiv:2405.02079 (Argumentative Large Language Models for Explainable and Contestable Decision-Making, 2024-05)

Your task:
(1) RE-TEST EACH CONSTRAINT. For the 17–33% hallucination rate and era-sensitivity findings: Have newer training pipelines, domain-specific RAG architectures, or fine-tuning on legal corpora (esp. historical precedent) since resolved the recency bias? Has competitive pressure driven vendors to publish independent audits? Separate the durable question (can closed-system legal AI be trusted across eras?) from perishable limitations (may be fixable via corpus balancing or retrieval hardening). Cite what resolves it.
(2) Surface the strongest contradicting or superseding work from the last ~6 months. Does any recent paper show that era-sensitivity is a solvable engineering problem, not a structural one?
(3) Propose 2 research questions that assume the regime may have moved: (a) If rationale-driven selection now dominates, does explaining *why* a historical case is cited reduce hallucination on older precedent? (b) Can argumentative frameworks expose when a model conflates overruled vs. live precedent?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

How do different legal AI tools compare in accuracy across case eras?

Sources 5 notes

Next inquiring lines