How much of the internet is AI-generated now?
What share of newly published websites contain AI-generated or AI-assisted content, and what measurable changes does this cause across semantic diversity, sentiment, accuracy, and style?
A representative sample of websites from the Internet Archive (2022-2025) measured with a state-of-the-art AI text detector finds that "roughly 35% of newly published websites were classified as AI-generated or AI-assisted" by mid-2025, up from zero before ChatGPT's launch in late 2022. This is the first large-scale empirical baseline for a phenomenon previously discussed only through anecdote and speculation (the "Dead Internet Theory").
What the data shows:
- Semantic diversity correlates negatively with AI text prevalence — ideas converge as AI content grows
- Positive sentiment correlates positively with AI text prevalence — the internet gets more upbeat
- Factual accuracy shows no statistically significant change
- Stylistic diversity shows no statistically significant change
The perception gap. A user study found that the majority of US adults believe all four hypotheses (reduced semantic diversity, increased positive sentiment, decreased factual accuracy, decreased stylistic diversity). People who do not use AI or use it infrequently believe in the negative impacts more; those who hold negative views of AI believe more strongly in the hypotheses. The perception of harm exceeds the measured harm on two of four dimensions — but is validated on the other two. Public fear is neither paranoia nor prophecy; it is half right.
The semantic diversity finding is the key result. Stylistic diversity is preserved — the words vary — but semantic diversity declines. This mirrors the pattern from since Do different AI models actually produce diverse outputs?: surface variation masks idea convergence. The internet is saying the same things in different ways.
Connection to model collapse. Since Does training on AI-generated content permanently degrade model quality?, the 35% AI content baseline establishes the starting condition for recursive degradation. If future models train on web crawls that are already one-third AI-generated, the tail distribution loss accelerates. The semantic diversity decline measured here may be the early empirical signal of model collapse manifesting in the wild, not in lab experiments.
The positive sentiment bias confirms what the homogeneity research predicts: AI output defaults to agreeable, constructive, and upbeat framing. Since Does AI homogenize culture the way mass media did?, the sentiment shift represents the AI culture industry's affective signature — systematically positive, systematically inoffensive, systematically unremarkable.
The factual accuracy non-finding is surprising given hallucination concerns but may reflect selection effects: AI-generated websites that contain obvious factual errors may be less likely to persist in the archive, or factual accuracy may be domain-dependent in ways the aggregate measure misses.
Inquiring lines that use this note as a source 4
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Can AI text detectors reliably identify AI-generated websites?
- Why can't algorithms distinguish between human and AI generated content quality?
- Do AI-generated posts crowd out human voices without any coordination or intent?
- How does AI content generation at scale threaten online trust and authenticity?
Related concepts in this collection 5
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Do different AI models actually produce diverse outputs?
Explores whether using multiple different language models together creates genuine diversity or whether shared training and alignment cause them to converge on similar answers despite independence.
semantic convergence despite stylistic variety; the mechanism behind declining semantic diversity
-
Does training on AI-generated content permanently degrade model quality?
When generative models train on outputs from previous models, do the resulting models lose rare patterns permanently? The question matters because future training data will inevitably contain synthetic content.
35% AI content is the baseline for recursive degradation
-
Does AI homogenize culture the way mass media did?
If AI generates contextually unique outputs, how can its underlying form be homogeneous? This explores whether AI repeats the culture industry's pattern of suppressing novelty under the guise of variety.
positive sentiment bias as affective signature of the AI culture industry
-
Can humans detect AI text if machines can measure it?
AI-generated text shows measurable differences from human writing across multiple linguistic dimensions, yet human judges consistently fail to identify it. Why does the gap between what is measurable and what is perceptible exist?
the detection gap: text is statistically distinguishable but pragmatically indistinguishable
-
Why do fake news detectors flag AI-generated truthful content?
Fake news detectors may systematically misclassify LLM-generated text as deceptive. We explore whether this bias stems from detecting AI style rather than actual falsehood, and what that means for detection accuracy.
AI detection as proxy for style detection, not truth detection
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- The Impact of AI-Generated Text on the Internet
- Thousands of AI Authors on the Future of AI
- StoryScope: Investigating idiosyncrasies in AI fiction
- Linguistic markers of inherently false AI communication and intentionally false human communication: Evidence from hotel reviews
- AI Enters Public Discourse: A Habermasian Assessment Of The Moral Status Of Large Language Models
- News Source Citing Patterns in AI Search Systems
- Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers
- Cognitive Effects in Large Language Models
Original note title
35 percent of new websites are AI-generated by mid-2025 — semantic diversity declines and positive sentiment rises but factual accuracy and stylistic diversity are unaffected