INQUIRING LINE

How do changes in human and AI writing distributions shift rarity measures over time?

This explores what 'rarity' means when both human and machine writing are moving targets — how originality measured as statistical rarity in a feature space shifts as AI outputs cluster, models change generation to generation, and that text feeds back into the distributions we measure against.


This explores what happens to 'rarity' as a yardstick when the things being measured — human and AI writing — are themselves drifting. The corpus treats rarity in two distinct senses, and seeing both is where it gets interesting. In one sense, rarity is a marker of value: StoryScope operationalizes originality as statistical rarity in a feature space of narrative decisions, finding that human stories occupy genuinely rarer regions while AI outputs cluster tightly together Can statistical rarity measure whether stories are truly original?. In the other sense, rarity is a signal of a model's weakness: curriculum textual-frequency training deliberately feeds models rare data first, because rarity marks the distance between a text and the model's pre-training distribution Does ordering training data by rarity actually improve language models?. The same word, two opposite valences — and that tension is exactly what shifts over time.

The moving-target problem is concrete. Newer model generations don't converge toward human lexical patterns; they diverge further from them, with ChatGPT-4.5 and o4-mini showing larger measurable gaps than earlier models even as they become harder for people to spot Why do newer AI models diverge further from human writing patterns?. So a rarity metric calibrated on yesterday's models is already stale: the distributional fingerprint keeps relocating. Six-dimension analyses confirm the gap is robust and statistical, spanning vocabulary volume, variety, evenness, and dispersion — yet trained linguists can't perceive it Can human judges detect measurable differences in AI text? Can humans detect AI text if machines can measure it?. Machines can track the drift; humans can't feel it.

Here's the thing you might not expect to care about: the measurement is contaminated by feedback. Writers edit AI-generated text only 23% of the time, and when they do, edits stay 96% similar to the original — so AI's distinctive distributional signature propagates into the published record almost unfiltered Do writers actually edit AI-generated text before publishing?. As that text accumulates in the human corpus, the 'human' baseline that rarity is measured against starts absorbing the AI distribution it was supposed to contrast with. Tomorrow's rare region may be today's machine output, smoothed into the background.

What resists this drift is structure, not surface. StoryScope hits 93% accuracy separating AI from human fiction using only discourse-level choices — character agency, chronological structure — while discarding stylistic cues, because those structural decisions require rewrites rather than word-swaps to disguise Can AI stories be detected without analyzing writing style?. Lexical rarity erodes as models improve and as edits paper over the seams; structural rarity holds because it lives below the layer where humanization happens. The deeper point is about why AI clusters in the first place: its token ordering is sequential but atemporal, with no reflective duration between choices Does AI text generation unfold through temporal reflection?, and its outputs are essentially mutable, varying with sampling and prompt rather than committing to anything Why does AI output change with every prompt and context?. So rarity over time isn't one clean trendline — it's a race between models converging toward human surface patterns and the underlying generative process that keeps them clustered no matter how the lexical numbers move.


Sources 9 notes

Can statistical rarity measure whether stories are truly original?

StoryScope operationalizes originality as statistical rarity in discourse-level narrative decisions. Human stories are measurably rarer in this space than AI outputs, which cluster tightly, offering a quantifiable proxy for the human conception copyright law requires.

Does ordering training data by rarity actually improve language models?

CTFT fine-tunes LLMs on rare data first because rarity signals distributional weakness, not conceptual difficulty. This reframes curriculum learning as managing distance from pre-training distribution rather than pedagogical scaffolding.

Why do newer AI models diverge further from human writing patterns?

ChatGPT-4.5 and o4-mini show greater lexical diversity differences from human text than earlier models, yet human judges cannot reliably distinguish them. Training objectives like RLHF appear to optimize for quality ratings rather than human-like writing patterns.

Can human judges detect measurable differences in AI text?

Six-dimension MANOVA analysis confirms significant differences between ChatGPT and human writing across vocabulary volume, abundance, variety, evenness, disparity, and dispersion. Despite these robust statistical differences, human judges including linguists and NLP researchers fail to reliably distinguish AI from human text.

Can humans detect AI text if machines can measure it?

LLM-generated text differs significantly on six lexical diversity dimensions, confirmed through statistical analysis across multiple models. Yet human judges, including trained linguists, cannot reliably detect these differences—and newer models diverge further while becoming harder to spot.

Do writers actually edit AI-generated text before publishing?

Writers edited AI-generated paragraphs only 23% of the time, with edits averaging 96% similarity to the original. This means AI's opinionated and distorted voice propagates with minimal human filtering before publication.

Can AI stories be detected without analyzing writing style?

StoryScope achieved 93.2% accuracy separating AI from human fiction using only discourse-level features like character agency and chronological structure, retaining 97% of performance while eliminating stylistic cues. These structural choices resist humanization because they require rewrites, not surface edits.

Does AI text generation unfold through temporal reflection?

Token ordering in LLMs follows probabilistic selection without intervening reflection or revision. Human discourse gains meaning from temporal structure—time spent thinking changes what comes next—but AI text production lacks this duration-in-reflection despite appearing sequentially composed.

Why does AI output change with every prompt and context?

AI outputs exhibit essential mutability—they vary with sampling, prompt wording, and audience interpretation. This is not a defect but a defining feature of tokens as media, making them fundamentally different from fixed commodities and resistant to traditional quality assurance.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question remains open: *How do changes in human and AI writing distributions shift rarity measures over time?* A curated library of LLM research (2023–2026) found:

**What a curated library found — and when (dated claims, not current truth):**
- Newer LLM generations (ChatGPT-4.5, o4-mini) diverge further from human lexical patterns measurably, even as they become harder for people to detect. (~2025–2026)
- StoryScope operationalizes originality as statistical rarity in narrative feature space; human stories occupy rarer regions; AI outputs cluster tightly. (~2026)
- LLM text differs measurably on six lexical dimensions (volume, variety, evenness, dispersion) but trained linguists cannot perceive the difference. (~2025)
- Writers edit AI-generated text only 23% of the time; edits remain 96% similar to originals, allowing AI distributional signature to propagate into published corpora unfiltered. (~2026)
- Discourse-level narrative choices (character agency, chronological structure) remain 93% separable AI–human signal; lexical rarity erodes but structural rarity holds. (~2026)
- AI token ordering is atemporal and probabilistic; outputs are mutable across sampling/prompt variation, fundamentally preventing convergence toward human commitment. (~2026)

**Anchor papers (verify; mind their dates):**
- arXiv:2508.00086 – Do LLMs produce texts with "human-like" lexical diversity? (2025)
- arXiv:2604.03136 – StoryScope: Investigating idiosyncrasies in AI fiction (2026)
- arXiv:2604.22503 – Measuring and Mitigating Persona Distortions from AI Writing Assistance (2026)
- arXiv:2305.17493 – The Curse Of Recursion: Training On Generated Data Makes Models Forget (2023)

**Your task:**
(1) RE-TEST: For each constraint above, determine whether post-2026 model releases, fine-tuning methods (instruction-tuning, RLHF variants), inference-time interventions (decoding constraints, structured generation), or evaluation methodologies have relaxed or overturned it. Separate the durable claim (rarity as a time-sensitive measure) from perishable limitations (lexical rarity as a reliable signal). What *still* holds?
(2) Surface the strongest *contradicting or superseding work* from the last ~6 months. Does any recent paper claim lexical or structural convergence, or show that corpus contamination doesn't degrade baselines as claimed?
(3) Propose 2 research questions that *assume* the regime has moved: e.g., "If structural rarity is now the only durable signal, what corpus-scale interventions could preserve it?" or "Does multi-modal AI writing (image+text) show the same atemporal clustering?"

Cite arXiv IDs. Flag anything you cannot ground in a real paper.

Next inquiring lines