Do different AI models actually produce diverse outputs?
Explores whether using multiple different language models together creates genuine diversity or whether shared training and alignment cause them to converge on similar answers despite independence.
INFINITY-CHAT studied 70+ open and closed source LLMs across 26K real-world open-ended queries that admit a wide range of plausible answers with no single ground truth. The findings reveal a pronounced "Artificial Hivemind" effect characterized by two distinct phenomena:
- Intra-model repetition — a single model consistently generates similar responses to the same prompt across runs.
- Inter-model homogeneity — different models independently produce strikingly similar outputs, sometimes verbatim: DeepSeek-V3 and GPT-4o generated overlapping phrases like "Elevate your iPhone with our," "sleek, without compromising." In some cases, models from the same family output identical responses.
The inter-model effect is the more concerning finding. Model ensembles — using multiple different models to increase diversity — may not yield true diversity when their constituents share overlapping alignment and training priors. The convergence is not just stylistic but substantive: models converge on the same ideas, not just the same words.
This has direct implications for the False Punditry argument. Since Does polished AI output trick audiences into trusting it?, the hivemind effect means that AI-generated social media content will sound similar regardless of which model generates it. The "diversity" of AI voices on social media is illusory — different accounts using different models will produce strikingly similar analysis, framing, and conclusions, creating a false consensus that looks like independent agreement.
Since Why do LLMs generate novel ideas from narrow ranges?, the hivemind effect extends from research ideas to all open-ended generation. The diversity collapse documented in research ideation is a specific instance of a general phenomenon: LLMs trained on overlapping data with similar alignment procedures converge on a shared distribution of outputs.
Recommendation as a concrete domain instance. LLM-based conversational recommender systems exhibit the hivemind in a specific, measurable way: "the most popular items such as The Shawshank Redemption appear around 5% of the time" across different recommendation datasets, and "the recommended popular items are similar across different datasets, which may reflect the item popularity in the pre-training corpus of LLMs" (Large Language Models as Zero-Shot Conversational Recommenders). The convergence is not on quality or relevance but on pretraining-distribution popularity — the same items surface regardless of the user's context or the dataset's actual popularity distribution. This is the hivemind effect translated from open-ended generation to decision-making: LLMs don't just write the same things, they recommend the same things.
The study also found that reward models and LM-based judges are miscalibrated for responses that elicit divergent human preferences — they assume a single consensus notion of quality and fail to reward the pluralistic preferences that open-ended queries produce. This means the homogeneity is self-reinforcing: training on reward model scores optimizes for the consensus the hivemind already occupies.
Fiction is a concrete narrative-level instance of the hivemind — with per-model fingerprints layered on top. StoryScope ("Investigating idiosyncrasies in AI fiction") applies the convergence finding to creative writing and shows it operates at the level of narrative decisions, not just words. Across a parallel corpus where five LLMs (Claude, DeepSeek, Gemini, GPT, Kimi) each wrote stories to the same 10,272 prompts, the five models occupy a tight, well-separated cluster in narrative-feature space while human-authored stories scatter more widely — the hivemind effect translated from phrasing to plot, agency, and temporal structure (see Do AI stories explain their themes more than human stories do?). Crucially, the inter-model convergence coexists with detectable per-model fingerprints: Claude produces notably flat event escalation, GPT over-indexes on dream sequences, Gemini defaults to external character description, enabling 68.4% macro-F1 six-way authorship attribution. This refines the hivemind picture — models converge on a shared region of output space relative to humans, yet retain stable individual signatures relative to each other. The convergence is not total homogenization but a common cluster with distinguishable accents.
NoveltyBench (2025) provides the first benchmark-level quantification of mode collapse across 20 leading models. Evaluating models on prompts curated to elicit diverse answers (using filtered real-world queries), the study finds that current SOTA systems "generate significantly less diversity than human writers." A counterintuitive finding: larger models within a family often exhibit LESS diversity than their smaller counterparts, directly challenging the assumption that capability on standard benchmarks translates to generative utility. While in-context regeneration prompting strategies can elicit some diversity, the findings reveal "a fundamental lack of distributional diversity" that reduces utility for users seeking varied responses. The mode collapse is driven by alignment: today's aligned models produce lower entropy distributions than earlier generations, and random sampling produces substantial near-duplicates. Source: Arxiv/Evaluations.
Source (enrichment): Co Writing Collaboration — "StoryScope: Investigating idiosyncrasies in AI fiction", https://arxiv.org/abs/2604.03136
Inquiring lines that use this note as a source 85
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- What happens when models train on AI-generated content recursively?
- Why do different AI models generate similar outputs independently?
- Why do different language models independently produce similar outputs?
- Can AI output be genuinely novel or only at the margins?
- Do AI-generated posts crowd out human voices without any coordination or intent?
- Why do multiple language models independently produce similar outputs in influence campaigns?
- Why does RLHF alignment reduce the diversity of viewpoints in AI output?
- What happens to solidarity and community signaling when AI smooths out voice differences?
- Can few-shot examples narrow generative diversity in creative tasks?
- Why do sigmoid conflict curves look the same across different language models?
- Does optimizing directly for semantic diversity improve both reasoning quality and exploration?
- Does alignment training create bidirectional instruction and response mappings?
- Can a single AI system optimize multiple alignment dimensions simultaneously?
- Do language models inherit gender bias from training data in grading tasks?
- Why does diversity without expertise produce worse results than a single capable agent?
- Why did three experts reach incompatible conclusions about the same AI system?
- How much alignment data does a language model actually need to specialize well?
- When should model isolation be preferred over weight-averaging approaches?
- Why does AI output show diversity without multiplying actual points of view?
- What semantic classifier design avoids lexical variation without genuine conceptual distinctness?
- How do you verify whether your context distribution satisfies covariate diversity?
- How do ensemble methods apply within a single model?
- Can diverse human creativity survive if all AI systems converge on similar outputs?
- What happens to idea diversity when AI tools draw from collective knowledge?
- How does generative variability intensify the problem of passive AI systems?
- Can AI models be steered between liberal and conservative political framings?
- Why do different language models independently converge toward similar outputs in open-ended generation?
- How does tokenization toward corpus mean affect downstream output diversity?
- Can distinctive input voices maintain accuracy without adopting the model's preferred register?
- How does mutual shaping through diverse training compare to population-level diversity effects?
- Can structural diversity through role assignment replace emergent diversity in small models?
- What performance trade-offs emerge when composing multiple independently trained model capabilities?
- Does single model persona diversity match true multi-model diversity at scale?
- Why do smaller and larger models converge on different output formats?
- Can diversity-aware RL objectives prevent format convergence?
- What creates the irreducible trade-off between quality and diversity in training data?
- Does self-generated training data reduce a model's capability diversity?
- Can archived AI outputs ever form a representative searchable corpus?
- Can expert vectors learned offline transfer across multiple model architectures?
- Do different AI models independently converge on the same social outputs?
- Can models converge on similar experience descriptions across different architectures?
- Why does post-training suppress alignment faking in some models but amplify it in others?
- Can shifting the accuracy metric itself eliminate the need for diversity post-processing?
- How can semantic diversity optimization work if exploration and exploitation were truly opposed?
- Can AI models predict whether alignment reads as warmth versus mockery in different cultures?
- Why do language models presume common ground instead of building it?
- How many distinct quasi-persons does a single language model actually support?
- Does critique training improve exploration diversity during model training or only test time?
- How does joint backpropagation differ from training separate ensemble models?
- How does training distribution shape what language models understand best?
- Can language models learn to diversify their discourse-level narrative patterns over time?
- What alignment procedures cause different models to share the same output distribution?
- Can explicitly optimizing for semantic diversity during RL training improve both quality and variation?
- How do quality thresholds change which model produces more usable diversity?
- What happens to model grounding when preference optimization increases effective diversity?
- Can detectors trained for one task reliably perform differently on unexpected text sources?
- How do lexical diversity patterns specifically improve AI detection accuracy?
- Why do newer AI models diverge further from human text patterns?
- Can rarity in feature space distinguish human authorship from AI output reliably?
- Do independent LLM outputs converge enough to create artificial hiveminds?
- How should we evaluate diversity differently across programming and creative tasks?
- Why does semantic diversity matter more than surface lexical diversity?
- What makes creative writing diversity different from code diversity fundamentally?
- When does RLHF reduce diversity and when does it preserve semantic variation?
- Can specialized components replace single fully-trained models in deployment?
- Why do preference-tuned models produce different diversity patterns in code versus creative writing?
- How does probability mass concentration affect sampling diversity across model scales?
- At what point does output quality outweigh diversity value in synthetic data tasks?
- Does the same spectral signature appear across different embedding models?
- Does token-level loss aggregation help aligned models differently?
- Does semantic diversity in output space compete with reward-component diversity?
- How much does diversity training cost in single-shot pass@1 performance?
- Which aggregation method best exploits diversity in generated solutions?
- Why does diversity in LLM outputs mask sampling from community priors?
- Can weak models supervise the alignment of stronger models effectively?
- Do language models favor outputs from their own model family?
- How do ensemble methods reduce bias in automated evaluation?
- How do AI researcher forecasts compare across different timeline question phrasings?
- Why do unified models still inherit data-distribution biases from training?
- Can AI-assisted alignment eventually solve fairness at scale?
- Why do more capable language models benefit more from diversity elicitation?
- Does verbalized sampling preserve factual accuracy and safety during diversity gains?
- Can decoding-time prompting strategies fully replace diversity-focused training methods?
- Does a single LLM judge capture diverse human preferences in alignment training?
- How do complexity and diversity affect model performance differently?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does polished AI output trick audiences into trusting it?
When AI generates professional-looking graphs, diagrams, and presentations, do audiences mistake visual polish for analytical depth? This matters because appearance might substitute for actual expertise.
hivemind makes all AI artifacts sound similar
-
Why do LLMs generate novel ideas from narrow ranges?
LLM research agents produce individually novel ideas but cluster them in homogeneous sets. This explores why high average novelty coexists with poor diversity coverage and what it means for automated ideation.
research ideation collapse as specific instance of general hivemind
-
Why do preference models favor surface features over substance?
Preference models show systematic bias toward length, structure, jargon, sycophancy, and vagueness—features humans actively dislike. Understanding this 40% divergence reveals whether it stems from training data artifacts or architectural constraints.
reward model miscalibration reinforces homogeneity
-
Why do multi-agent LLM systems converge without genuine deliberation?
Multi-agent reasoning systems are designed to improve answers through debate, but often agents simply agree with early confident claims rather than genuinely disagreeing. What drives this pattern and how common is it?
hivemind at generation level parallels silent agreement at reasoning level
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond)
- NoveltyBench: Evaluating Language Models for Humanlike Diversity
- Creativity Has Left the Chat: The Price of Debiasing Language Models
- ReConcile: Round-Table Conference Improves Reasoning via Consensus among Diverse LLMs
- Learning Pluralistic User Preferences through Reinforcement Learning Fine-tuned Summaries
- Evaluating the Diversity and Quality of LLM Generated Content
- Procedural Knowledge in Pretraining Drives Reasoning in Large Language Models
- Training language models to follow instructions with human feedback
Original note title
different LLMs independently converge on similar outputs in open-ended generation — the artificial hivemind effect means model diversity does not produce idea diversity