Do LLMs overgeneralize when summarizing scientific research?
When LLMs summarize science papers, do they drop important qualifiers and scope limits? This matters because such summaries might mislead readers about what findings actually show.
When LLMs summarize science, they tend to drop the qualifiers that bound a study's conclusions — turning "in this sample, under these conditions" into a universal claim. Comparing 4,900 LLM-generated summaries to their source texts across ten models, most overgeneralized: DeepSeek, ChatGPT-4o, and LLaMA 3.3 70B did so in 26-73% of cases, and in a head-to-head LLM summaries were nearly five times more likely than human-authored ones to contain broad generalizations (odds ratio 4.85).
Two findings sharpen it into a design warning. First, prompting for accuracy backfires: asking explicitly for a summary "faithful to the original text" produced roughly twice the overgeneralization of a plain summarization request — the accuracy instruction made things worse, extending the pattern that adding accuracy-intended instructions can be counterproductive. Second, newer models were worse than earlier ones, so this is not a defect that scale and iteration are erasing.
The consequence for science communication is direct: LLMs systematically inflate the scope of findings, and the obvious mitigation (tell it to be accurate) is unreliable. This is the summarization-side complement to Can models express uncertainty instead of just answering? — overgeneralization is dropped epistemic qualification, the inverse of faithful uncertainty — and it grounds Does polished AI output trick audiences into trusting it? in a measured science-communication harm.
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can models express uncertainty instead of just answering?
Most factuality work expands what models know rather than what they know they know. Can expressing calibrated uncertainty create a third path between confident errors and unhelpful abstention?
overgeneralization is the inverse: dropping the epistemic qualifiers faithful uncertainty would preserve
-
Does polished AI output trick audiences into trusting it?
When AI generates professional-looking graphs, diagrams, and presentations, do audiences mistake visual polish for analytical depth? This matters because appearance might substitute for actual expertise.
fluent overgeneralized summaries carry unearned authority
-
Can RAG systems refuse to answer without reliable evidence?
Explores whether retrieval-augmented generation can be designed to abstain from answering when sources are corrupted or insufficient, rather than filling gaps with plausible-sounding guesses. This matters for historical text where OCR errors and language drift are common.
both concern fidelity to source; here the failure is scope inflation rather than confabulation
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Generalization Bias in Large Language Model Summarization of Scientific Research
- Has the Creativity of Large-Language Models peaked? —an analysis of inter- and intra-LLM variability —
- LLM Augmentations to support Analytical Reasoning over Multiple Documents
- The LLM Fallacy: Misattribution in AI-Assisted Cognitive Workflows
- The Alternative Annotator Test for LLM-as-a-Judge: How to Statistically Justify Replacing Human Annotators with LLMs
- AI Meets the Classroom: When Does ChatGPT Harm Learning?
- From Prompt Engineering to Prompt Science With Human in the Loop
- MODS: Moderating a Mixture of Document Speakers to Summarize Debatable Queries in Document Collections
Original note title
LLM science summaries systematically overgeneralize beyond the source and prompting for accuracy backfires — newer models are worse