Can LLMs generate more novel ideas than human experts?
Research shows LLM-generated ideas score higher for novelty than expert-generated ones, yet LLMs avoid the evaluative reasoning that characterizes expert thinking. What explains this apparent contradiction?
Two findings appear to conflict: LLM-generated research ideas are rated more novel than expert-generated ones (Si et al. 2024), yet LLM academic writing systematically avoids the evaluative and evidential nouns that characterize expert intellectual work — it prefers manner nouns (describing process) over status nouns (assessing claims) and evidential nouns (grounding in evidence). How can a system that avoids evaluative stance-taking produce ideas judged more novel than those of evaluating experts?
The resolution: generation and evaluation are dissociated cognitive operations, and LLMs are asymmetrically capable at them.
Generation: Combining existing concepts in new configurations. LLMs have combinatorial range that exceeds human disciplinary range — they are not anchored by domain priors or professional reputation costs. A human expert generates ideas constrained by what is tractable, publishable, and consistent with their existing commitments. LLMs face none of these constraints. The result is wider combinatorial reach, which produces higher novelty scores.
Evaluation: Assessing whether a generated idea is correct, feasible, important, or properly evidenced. This requires epistemic commitment — making a judgment call and defending it. Since Should we call LLM errors hallucinations or fabrications?, LLMs have no internal corrective mechanism — they cannot distinguish their accurate claims from their inaccurate ones using the same generative process. Evaluative stance-taking requires exactly this distinction.
The dissociation explains the feasibility gap: LLM ideas are more novel and less feasible. Novelty comes from unconstrained combinatorics; infeasibility comes from the absence of evaluation that would filter out the implausible combinations. Human experts generate fewer novel ideas because they self-evaluate more aggressively during generation.
This has implications for how to use LLMs in research workflows: they are combinatorial idea generators, not evaluators. The appropriate workflow pairs LLM generation with human evaluation, not LLM evaluation of LLM ideas.
Human approval becomes the structural bottleneck. The asymmetry has a workflow consequence beyond appropriate pairing: as AI-generated volume increases, evaluation — which remains on the human side — becomes the capacity-limiting step. AI generates faster than humans can evaluate. Since What collaboration level do workers actually want with AI?, the desired partnership shape aligns with this constraint: humans do not want to be sidelined, but they also cannot keep up if their role is saturated with approval work. The bottleneck shifts from production (where AI excels) to validation (which AI cannot do for itself), and the ergonomic consequence is that the human reviewer's cognitive load scales with AI throughput. Designing AI-augmented workflows that ignore this bottleneck produces a pipeline where volume accumulates faster than validation, and unvalidated output becomes the default rather than the exception.
Empirical closure: the ideation-execution gap. A large-scale execution study (N=43 experts, 100+ hours each, The Ideation-Execution Gap) provides direct evidence: when LLM-generated and human ideas are randomly assigned to expert implementers, "the scores of the LLM-generated ideas decrease significantly more than expert-written ideas on all evaluation metrics (novelty, excitement, effectiveness, and overall; p<0.05), closing the gap between LLM and human ideas observed at the ideation stage." Execution reveals weaknesses invisible at ideation — missing baselines, impractical evaluation methods, poor generalizability. LLM ideas systematically propose evaluations requiring human expert recruitment that executors always change. See Do LLM research ideas actually hold up when experts try to execute them?.
The literary criticism case: Literary criticism is the domain where the ideation-evaluation dissociation is most consequential, because criticism requires both operations simultaneously. A critic must identify what a text does (the generative/recognition side — which devices are present, what patterns emerge) AND judge whether it succeeds (the evaluative side — does this metaphor work, does this structure serve the argument, is this ambiguity productive or merely confusing). LLMs can perform the first operation impressively — detecting rhetorical devices, extracting metaphoric mappings, identifying stylistic signatures. They cannot perform the second. Since Can LLMs truly understand literary meaning or just mechanics?, literary analysis is where the dissociation stops being an interesting theoretical observation and becomes a functional barrier.
Inquiring lines that use this note as a source 17
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Why do LLMs generate ideas that sound novel but fail during execution?
- Do LLMs match top human creative writers in literary quality?
- Why does LLM research ideation collapse into low diversity despite high novelty?
- How can LLMs evaluate their own creative outputs for utility and novelty?
- Why do LLM-generated ideas score higher novelty yet lower feasibility than expert ideas?
- Why do LLMs plateau on creativity tasks while humans reach further?
- Can LLMs reliably assess the quality of ideas they generate?
- Why do LLM research ideas lack diversity despite high average novelty?
- Why do LLMs generate novel ideas but lack evaluative commitment?
- Do LLMs generate more novel ideas than they can evaluate?
- Why do models generate creative ideas but fail to evaluate their legitimacy?
- Why do LLMs generate novel ideas but struggle to evaluate them?
- What makes novelty assessment harder to automate than idea generation?
- Can LLMs generate more novel research ideas than human experts?
- Do novelty and feasibility always trade off in idea generation?
- Which LLM backends produce the most executable research ideas?
- Why are AI research ideas more novel but harder to evaluate than human ones?
Related concepts in this collection 8
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Do language models generate more novel research ideas than experts?
Explores whether LLMs can break free from expert constraints to generate more novel research concepts. Matters because novelty is often thought to be AI's creative blind spot.
the generation-side finding; this note explains why novelty without feasibility is the expected outcome
-
Why do ChatGPT essays lack evaluative depth despite grammatical strength?
ChatGPT writes grammatically coherent academic prose but uses fewer evaluative and evidential nouns than student writers. The question explores whether this rhetorical gap—favoring description over argument—reflects a fundamental limitation in how LLMs approach academic writing.
the evaluation-side finding; structurally coherent but evaluatively absent
-
Should we call LLM errors hallucinations or fabrications?
Does the language we use to describe LLM failures shape the technical solutions we build? Examining whether perceptual and psychological frameworks misdiagnose what's actually happening.
grounds: no internal corrective mechanism means evaluation and generation are not coupled; what is generated is not assessed before output
-
Can imitating ChatGPT fool evaluators into thinking models improved?
Explores whether fine-tuning weaker models on ChatGPT outputs creates an illusion of capability gains. Investigates why human raters and automated judges fail to detect that imitation improves style but not underlying factuality or reasoning.
practical consequence of the dissociation: imitation models capture the generative style (combinatorial fluency) while missing factual grounding (evaluative accuracy), because imitation training optimizes the generation side that LLMs are already good at
-
Does chatbot interaction trade authenticity for better problem-solving?
When students solve problems with AI chatbots instead of peers, do they sacrifice personal voice and subjective expression in exchange for more efficient knowledge exchange and higher task performance?
the dissociation manifests in educational settings: chatbots provide efficient knowledge generation (the combinatorial side) but the absence of evaluative stance-taking means students stop articulating and defending their own positions, mirroring the generation-without-evaluation pattern
-
Can LLMs reason creatively beyond conventional problem-solving?
Explores whether large language models can engage in truly creative reasoning that expands or redefines solution spaces, rather than just decomposing known problems. This matters because existing reasoning methods may miss creative capabilities entirely.
UoT's three-axis evaluation (feasibility + utility + novelty) directly addresses the evaluation gap: the dissociation means LLMs can generate across all three creative paradigms but cannot assess which outputs are feasible or useful without an external evaluative framework
-
Why do LLMs generate novel ideas from narrow ranges?
LLM research agents produce individually novel ideas but cluster them in homogeneous sets. This explores why high average novelty coexists with poor diversity coverage and what it means for automated ideation.
practical manifestation of evaluation dissociation: models cannot assess that they are repeatedly sampling from the same high-novelty cluster; self-evaluation failures prevent recognizing when diversity has collapsed, making the generation-without-evaluation pattern visible at the population level
-
Why do LLMs excel at feasible design but struggle with novelty?
When LLMs generate conceptual product designs, they produce more implementable and useful solutions than humans but fewer novel ones. This explores why domain constraints flip the novelty advantage seen in research ideation.
domain inversion: in constrained design domains where evaluation criteria are embedded in the prompt (feasibility, usefulness ratings), models channel generation toward conservative solutions; the dissociation flips — evaluation constraints suppress novelty rather than being absent from it
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- The Ideation-Execution Gap: Execution Outcomes of LLM-Generated versus Human Research Ideas
- Has the Creativity of Large-Language Models peaked? —an analysis of inter- and intra-LLM variability —
- Agent Laboratory: Using LLM Agents as Research Assistants
- Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers
- The LLM Fallacy: Misattribution in AI-Assisted Cognitive Workflows
- Explain-Query-Test: Self-Evaluating LLMs Via Explanation and Comprehension Discrepancy
- Conceptual Design Generation Using Large Language Models
- Unlocking Varied Perspectives: A Persona-Based Multi-Agent Framework with Debate-Driven Text Planning for Argument Generation
Original note title
llm ideation and evaluation are dissociated — combinatorial generation can exceed human novelty while evaluative stance-taking remains structurally absent