Why do human stories land in statistically rarer regions than AI narratives?
This explores why AI-generated stories cluster toward the statistically 'expected' middle while human stories reach into low-probability territory — and what in the machinery pushes each toward its region.
This explores why AI-generated stories cluster toward the statistically 'expected' middle while human stories reach into low-probability territory. The short version the corpus suggests: AI narratives land where they do *by construction*, because next-token prediction is a machine for finding the most likely continuation — and the most likely continuation is, almost by definition, the least surprising one.
The clearest tell is in how models handle the shape of a story rather than its sentences. When AI fiction is stripped of all stylistic cues, it's still separable from human writing at 93% accuracy on discourse-level features alone — character agency, chronological ordering, plot structure Can AI stories be detected without analyzing writing style?. Those structural choices skew toward tidy single-track plots, over-explained themes, and avoided moral ambiguity, while human stories lean on nonlinear time and unresolved tension Do AI stories explain their themes more than human stories do?. A linear, explained, morally-clean plot is the high-probability path; ambiguity and temporal scrambling are the rare detours. AI takes the well-trodden road because that's what 'most likely' means.
The sharpest piece of evidence sits in event cognition: GPT segments a narrative into events *closer to the averaged human consensus than any individual human annotator does* Do language models segment events like human consensus does?. Read that as the whole answer in miniature. The model isn't matching a person — it's matching the centroid of all people. Any single human is an outlier from that average; the model is the average made fluent. So human stories land in rarer regions not because humans are trying to be original, but because every individual deviates from the consensus the model is trained to reproduce.
What's surprising is that this gap is *widening*, not closing. Newer models diverge further from human lexical patterns even as they become harder for judges to catch Why do newer AI models diverge further from human writing patterns?. The reason points back to the training objective: RLHF optimizes for what raters score as high-quality, not for what looks human — and 'high quality' to a rater is smooth, coherent, legible, which is exactly the statistical center. The models are being actively pulled toward the safe region while learning to disguise that they're there.
And the gap propagates downstream because almost nobody pushes the text back out toward the rare regions: writers edit AI paragraphs only 23% of the time, with edits averaging 96% similarity to the original Do writers actually edit AI-generated text before publishing?. So the centered voice reaches readers largely uncorrected. If you want to go one layer deeper, there's a related claim that AI output lacks the internal *appeal to a reader's attention* that human communication performs Does AI writing lack the internal appeal to attention that humans use? — a hint that the rarity of human narrative isn't just statistical noise but a trace of someone actually reaching for a listener.
Sources 6 notes
StoryScope achieved 93.2% accuracy separating AI from human fiction using only discourse-level features like character agency and chronological structure, retaining 97% of performance while eliminating stylistic cues. These structural choices resist humanization because they require rewrites, not surface edits.
Analysis of 304 narrative features reduced to 30 core signals shows AI fiction systematically over-explains themes, uses tidy single-track plots, and avoids moral ambiguity, while human stories employ temporal complexity and nonlinear structure. This pattern holds across all five major LLM models tested.
GPT-3's event boundaries correlate more strongly with averaged human annotations than individual human annotators do. This suggests language models may pre-compute statistical consensus through training on diverse text, or that next-token prediction parallels human event cognition.
ChatGPT-4.5 and o4-mini show greater lexical diversity differences from human text than earlier models, yet human judges cannot reliably distinguish them. Training objectives like RLHF appear to optimize for quality ratings rather than human-like writing patterns.
Writers edited AI-generated paragraphs only 23% of the time, with edits averaging 96% similarity to the original. This means AI's opinionated and distorted voice propagates with minimal human filtering before publication.
Human writing contains an appeal to the reader's attention as a fundamental property of communication itself. AI-generated posts inherit platform visibility but do not perform this internal appeal, producing the reported aloofness readers perceive — a structural absence, not a stylistic defect.