Does statistical rarity actually correlate with originality that law should protect?

This explores whether 'statistically rare' and 'legally original' are the same thing — whether measuring how unusual a text is in some feature space can stand in for the human authorship copyright law actually protects.

This explores whether statistical rarity is a good proxy for the kind of originality law is meant to protect — and the corpus suggests the answer is 'useful, but don't mistake the proxy for the thing.' The strongest case for yes comes from StoryScope, which operationalizes originality as rarity in the space of discourse-level narrative choices and finds human stories genuinely occupy rarer regions while AI outputs cluster tightly together Can statistical rarity measure whether stories are truly original?. That's a real signal: it gives copyright's fuzzy 'human conception' requirement something measurable to point at, and it lines up with the observation that independent models converge on similar outputs despite competing, homogenizing culture in ways invisible to any single user Does AI homogenize culture the way mass media did?.

But the corpus also shows rarity measuring things that have nothing to do with protectable originality. In curriculum training, rare data is treated as a sign of distributional weakness — a gap from the pre-training distribution to be patched — not as conceptual value Does ordering training data by rarity actually improve language models?. In retrieval systems, rarity is just a failure-mode detector, flagging where a model is likely to hallucinate about uncommon entities Should RAG systems use model confidence or data rarity to trigger retrieval?. Same statistic, opposite meaning: there, being rare makes something a liability, not a contribution. So rarity alone can't tell you whether you're looking at a creative leap or a data hole.

The sharpest crack appears when you separate 'novel' from 'valuable.' LLMs can generate research ideas rated statistically *more* novel than expert ideas — while scoring lower on feasibility, because expert knowledge constrains novelty toward what actually works Do language models generate more novel research ideas than experts?. Rarity rewards the unconstrained wandering; the thing we usually mean by 'original and worth protecting' includes the discipline that makes rarity meaningful rather than merely odd. This is why structured novelty assessment — extract the claims, retrieve the prior art, compare — aligns far better with human reviewers than any holistic 'how unusual does this feel' measure Can structured pipelines make LLM novelty assessment reliable?. Originality judgments humans trust are relational and contextual, not a single distance-from-the-mean number.

There's also a deeper objection the corpus raises: law may protect something rarity *cannot see at all*. The argument that AI output carries only 'statistical residue' rather than the spirit of a giver locates authorship in provenance — the fact that a person made it — not in any property of the text itself Why doesn't AI output carry the spirit of a giver?. On that view a statistically rare AI passage and a statistically common human one could land on opposite sides of the legal line from where a rarity metric would put them, because what's protected is the relationship, not the feature vector. The related claim that AI output is structurally hearsay — unattributable at the origin — pushes the same way: the thing legal tools are built to track is the chain back to a source, which rarity discards Does AI-generated knowledge have the same structure as hearsay?.

So: rarity correlates with originality well enough to be a genuinely useful detector — especially for telling tightly-clustered machine output from the wider spread of human work — but it conflates creativity with distributional weirdness, rewards novelty unconstrained by value, and is blind to the provenance that may be what law actually protects. The interesting takeaway is that the best published proxy and the strongest critique of proxies live in the same collection, and they don't contradict so much as mark the boundary of what any single statistic can carry.

Sources 8 notes

Can statistical rarity measure whether stories are truly original?

StoryScope operationalizes originality as statistical rarity in discourse-level narrative decisions. Human stories are measurably rarer in this space than AI outputs, which cluster tightly, offering a quantifiable proxy for the human conception copyright law requires.

Does AI homogenize culture the way mass media did?

AI mass-generates similar flows disguised as personalized outputs, suppressing novelty more deeply than pre-stamped commodities because contextual customization makes homogeneity invisible to individual users. Evidence: independent LLMs converge on similar outputs despite nominal competition.

Does ordering training data by rarity actually improve language models?

CTFT fine-tunes LLMs on rare data first because rarity signals distributional weakness, not conceptual difficulty. This reframes curriculum learning as managing distance from pre-training distribution rather than pedagogical scaffolding.

Should RAG systems use model confidence or data rarity to trigger retrieval?

Model confidence and data-rarity signals catch orthogonal failure modes: confidence misses hallucinations about rare entities, while rarity misses uncertain reasoning about common knowledge. Hybrid triggers substantially outperform either signal alone.

Do language models generate more novel research ideas than experts?

A statistically significant study of 100+ NLP researchers found LLM-generated ideas rated as more novel than human expert ideas (p<0.05), though slightly lower on feasibility. Expert knowledge constrains novelty, while LLMs explore wider conceptual combinations.

Can structured pipelines make LLM novelty assessment reliable?

A three-stage pipeline (extract claims, retrieve related work, compare) reached 86.5% reasoning alignment and 75.3% conclusion agreement with human reviewers on 182 ICLR submissions, outperforming holistic LLM baselines.

Why doesn't AI output carry the spirit of a giver?

AI-generated content lacks hau—the spiritual essence that binds gift economies—because no person gave it. This absence is more fundamental than alienation: the output was never anyone's to begin with, so no relationship of obligation forms.

Does AI-generated knowledge have the same structure as hearsay?

AI output shares all defining features of hearsay: testimony at remove, modification in retelling, unattributable origin, and unverifiability against stable sources. This means Enlightenment verification tools—citation, archiving, peer review, evidentiary chains—cannot process AI output by design.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a legal scholar and AI researcher testing whether statistical rarity is a durable proxy for originality in copyright doctrine. A curated library (2024–2026) has mapped the tension; your job is to judge what's held and what's cracked.

What a curated library found — and when (dated claims, not current truth):
• StoryScope (2026) shows human narratives occupy statistically rarer regions of discourse-space than AI outputs, which cluster tightly — offering a measurable signal for copyright's 'human conception' requirement.
• LLM-generated research ideas score higher on statistical novelty than expert ideas, but lower on feasibility; rarity rewards unconstrained wandering, not disciplined originality (2024).
• Curriculum training treats statistical rarity as a distributional weakness to patch, not a sign of value (2026); retrieval systems treat it as a hallucination red flag (2025).
• Structured novelty assessment—extracting claims, retrieving prior art, comparing—aligns 86% with human reviewers; holistic 'how unusual' measures do not (2025).
• The provenance argument: law may protect authorship as a relationship (who made it) rather than a text property, making rarity legally blind to what actually matters (2025).

Anchor papers (verify; mind their dates):
• arXiv:2604.03136 (StoryScope, 2026)
• arXiv:2409.04109 (LLM novelty vs. feasibility, 2024)
• arXiv:2504.12320 (LLM creativity drift, 2025)
• arXiv:2507.20525 (Xeno Sutra: meaning & value in AI text, 2025)

Your task:
(1) RE-TEST EACH CONSTRAINT. For each finding above—especially the clustering claim and the feasibility gap—check whether 2026–present models, improved sampling strategies, or finer-grained evaluation have narrowed these gaps or dissolved them. Separate the durable question (Is rarity a proxy?) from perishable limitations (Does this model cluster tightly? Does this metric misread value?). Where a constraint has been relaxed, cite what relaxed it; where it still holds, say so plainly.
(2) SURFACE THE STRONGEST DISAGREEMENT in the last 6 months. Is there work arguing rarity *does* track protectable originality despite provenance concerns? Or work deepening the critique that rarity and value are orthogonal? Flag the contradiction.
(3) PROPOSE 2 research questions that assume the legal regime may have shifted: e.g., if courts adopt structured novelty assessment, how should copyright doctrine integrate retrieval-based comparison? Or: if provenance becomes the legal anchor, what replaces rarity as a feasibility heuristic for originality review?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Does statistical rarity actually correlate with originality that law should protect?

Sources 8 notes

Next inquiring lines