Do different AI models actually produce diverse outputs?

Explores whether using multiple different language models together creates genuine diversity or whether shared training and alignment cause them to converge on similar answers despite independence.

Synthesis note · 2026-03-27 · sourced from Foundation Models

INFINITY-CHAT studied 70+ open and closed source LLMs across 26K real-world open-ended queries that admit a wide range of plausible answers with no single ground truth. The findings reveal a pronounced "Artificial Hivemind" effect characterized by two distinct phenomena:

Intra-model repetition — a single model consistently generates similar responses to the same prompt across runs.
Inter-model homogeneity — different models independently produce strikingly similar outputs, sometimes verbatim: DeepSeek-V3 and GPT-4o generated overlapping phrases like "Elevate your iPhone with our," "sleek, without compromising." In some cases, models from the same family output identical responses.

The inter-model effect is the more concerning finding. Model ensembles — using multiple different models to increase diversity — may not yield true diversity when their constituents share overlapping alignment and training priors. The convergence is not just stylistic but substantive: models converge on the same ideas, not just the same words.

This has direct implications for the False Punditry argument. Since Does polished AI output trick audiences into trusting it?, the hivemind effect means that AI-generated social media content will sound similar regardless of which model generates it. The "diversity" of AI voices on social media is illusory — different accounts using different models will produce strikingly similar analysis, framing, and conclusions, creating a false consensus that looks like independent agreement.

Since Why do LLMs generate novel ideas from narrow ranges?, the hivemind effect extends from research ideas to all open-ended generation. The diversity collapse documented in research ideation is a specific instance of a general phenomenon: LLMs trained on overlapping data with similar alignment procedures converge on a shared distribution of outputs.

Recommendation as a concrete domain instance. LLM-based conversational recommender systems exhibit the hivemind in a specific, measurable way: "the most popular items such as The Shawshank Redemption appear around 5% of the time" across different recommendation datasets, and "the recommended popular items are similar across different datasets, which may reflect the item popularity in the pre-training corpus of LLMs" (Large Language Models as Zero-Shot Conversational Recommenders). The convergence is not on quality or relevance but on pretraining-distribution popularity — the same items surface regardless of the user's context or the dataset's actual popularity distribution. This is the hivemind effect translated from open-ended generation to decision-making: LLMs don't just write the same things, they recommend the same things.

The study also found that reward models and LM-based judges are miscalibrated for responses that elicit divergent human preferences — they assume a single consensus notion of quality and fail to reward the pluralistic preferences that open-ended queries produce. This means the homogeneity is self-reinforcing: training on reward model scores optimizes for the consensus the hivemind already occupies.

Fiction is a concrete narrative-level instance of the hivemind — with per-model fingerprints layered on top. StoryScope ("Investigating idiosyncrasies in AI fiction") applies the convergence finding to creative writing and shows it operates at the level of narrative decisions, not just words. Across a parallel corpus where five LLMs (Claude, DeepSeek, Gemini, GPT, Kimi) each wrote stories to the same 10,272 prompts, the five models occupy a tight, well-separated cluster in narrative-feature space while human-authored stories scatter more widely — the hivemind effect translated from phrasing to plot, agency, and temporal structure (see Do AI stories explain their themes more than human stories do?). Crucially, the inter-model convergence coexists with detectable per-model fingerprints: Claude produces notably flat event escalation, GPT over-indexes on dream sequences, Gemini defaults to external character description, enabling 68.4% macro-F1 six-way authorship attribution. This refines the hivemind picture — models converge on a shared region of output space relative to humans, yet retain stable individual signatures relative to each other. The convergence is not total homogenization but a common cluster with distinguishable accents.

NoveltyBench (2025) provides the first benchmark-level quantification of mode collapse across 20 leading models. Evaluating models on prompts curated to elicit diverse answers (using filtered real-world queries), the study finds that current SOTA systems "generate significantly less diversity than human writers." A counterintuitive finding: larger models within a family often exhibit LESS diversity than their smaller counterparts, directly challenging the assumption that capability on standard benchmarks translates to generative utility. While in-context regeneration prompting strategies can elicit some diversity, the findings reveal "a fundamental lack of distributional diversity" that reduces utility for users seeking varied responses. The mode collapse is driven by alignment: today's aligned models produce lower entropy distributions than earlier generations, and random sampling produces substantial near-duplicates. Source: Arxiv/Evaluations.

Source (enrichment): Co Writing Collaboration — "StoryScope: Investigating idiosyncrasies in AI fiction", https://arxiv.org/abs/2604.03136

Inquiring lines that use this note as a source 85

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

22 direct connections · 200 in 2-hop network ·dense cluster Open in graph ↗

Do different AI models actually produce diverse … Does polished AI output trick audiences into trust… Why do LLMs generate novel ideas from narrow range… Why do preference models favor surface features ov… Why do multi-agent LLM systems converge without ge…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does polished AI output trick audiences into trusting it? When AI generates professional-looking graphs, diagrams, and presentations, do audiences mistake visual polish for analytical depth? This matters because appearance might substitute for actual expertise.
hivemind makes all AI artifacts sound similar
Why do LLMs generate novel ideas from narrow ranges? LLM research agents produce individually novel ideas but cluster them in homogeneous sets. This explores why high average novelty coexists with poor diversity coverage and what it means for automated ideation.
research ideation collapse as specific instance of general hivemind
Why do preference models favor surface features over substance? Preference models show systematic bias toward length, structure, jargon, sycophancy, and vagueness—features humans actively dislike. Understanding this 40% divergence reveals whether it stems from training data artifacts or architectural constraints.
reward model miscalibration reinforces homogeneity
Why do multi-agent LLM systems converge without genuine deliberation? Multi-agent reasoning systems are designed to improve answers through debate, but often agents simply agree with early confident claims rather than genuinely disagreeing. What drives this pattern and how common is it?
hivemind at generation level parallels silent agreement at reasoning level

Do different AI models actually produce diverse outputs?

Related concepts in this collection 4

Related papers in this collection 8

Search by related questions 4