Do LLM research ideas actually hold up when experts try to execute them?
Explores whether LLM-generated ideas maintain their apparent novelty advantage when expert researchers spend 100+ hours implementing them. Matters because ideation-stage evaluation may not capture real-world feasibility barriers.
The ideation novelty finding (Si et al. 2025) showed LLM-generated research ideas rated significantly more novel than human expert ideas. This execution study provides the empirical reality check: when 43 expert researchers each spend over 100 hours implementing randomly-assigned ideas and writing 4-page papers, the novelty advantage disappears.
Comparing review scores before and after execution, "the scores of the LLM-generated ideas decrease significantly more than expert-written ideas on all evaluation metrics (novelty, excitement, effectiveness, and overall; p<0.05), closing the gap between LLM and human ideas observed at the ideation stage." For many metrics, there is a ranking flip where human ideas score higher after execution.
The mechanism is precise: execution imposes feasibility constraints that ideation evaluation cannot anticipate. "During execution, every single step has to be grounded in realistic execution constraints, which impose higher feasibility standards than the ideation stage." Reviewers discover weaknesses only visible through implementation — missing baselines, poor generalizability, impractical evaluation designs, high resource requirements. AI-generated ideas systematically propose evaluations requiring human expert recruitment that executors always change to save cost and time.
This resolves the tension between Can LLMs generate more novel ideas than human experts? and Why do LLMs excel at feasible design but struggle with novelty?. The ideation-evaluation dissociation IS the problem — LLMs generate novel-sounding ideas precisely because they lack the evaluative capacity to recognize execution barriers. Novelty at ideation is a property of description quality, not executability. Since Why do LLMs generate novel ideas from narrow ranges?, individual LLM ideas may be novel AND individually infeasible — the two findings compound rather than contradict.
The implication for AI-assisted research is that proxy evaluation (judging ideas without execution) systematically overestimates LLM contribution. "Objective metrics like feasibility and effectiveness are best judged via the actual execution outcomes rather than speculative judgment based on the ideas." This challenges any benchmark or evaluation that rates AI research capability without implementation.
Inquiring lines that use this note as a source 22
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Why do LLMs generate ideas that sound novel but fail during execution?
- Can proxy evaluation of ideas accurately predict their quality without implementation?
- What specific execution barriers do LLM ideas encounter most frequently?
- Why does LLM research ideation collapse into low diversity despite high novelty?
- How can LLMs evaluate their own creative outputs for utility and novelty?
- Why does LLM knowledge fail to influence their actual outputs?
- Why do LLM-generated ideas score higher novelty yet lower feasibility than expert ideas?
- Can LLMs reliably assess the quality of ideas they generate?
- Why do LLM research ideas lack diversity despite high average novelty?
- What makes a novel research idea practically infeasible for implementation?
- Why do LLMs generate novel ideas but lack evaluative commitment?
- Do LLMs generate more novel ideas than they can evaluate?
- How do years of A/B testing compare to one-shot LLM content generation?
- Why do models generate creative ideas but fail to evaluate their legitimacy?
- Why do LLMs generate novel ideas but struggle to evaluate them?
- What makes novelty assessment harder to automate than idea generation?
- Can LLMs generate more novel research ideas than human experts?
- Can human researchers improve LLM ideas through iterative feedback?
- Do novelty and feasibility always trade off in idea generation?
- Which LLM backends produce the most executable research ideas?
- Can LLM diversity collapse in research ideation be reversed or mitigated?
- What distinguishes scientific plausibility from cognitive availability in research ideas?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can LLMs generate more novel ideas than human experts?
Research shows LLM-generated ideas score higher for novelty than expert-generated ones, yet LLMs avoid the evaluative reasoning that characterizes expert thinking. What explains this apparent contradiction?
the ideation-execution gap is the empirical consequence of this dissociation
-
Why do LLMs generate novel ideas from narrow ranges?
LLM research agents produce individually novel ideas but cluster them in homogeneous sets. This explores why high average novelty coexists with poor diversity coverage and what it means for automated ideation.
compounds: individually novel + collectively homogeneous + individually infeasible
-
Why do LLMs excel at feasible design but struggle with novelty?
When LLMs generate conceptual product designs, they produce more implementable and useful solutions than humans but fewer novel ones. This explores why domain constraints flip the novelty advantage seen in research ideation.
domain inversion: research ideas are novel-not-feasible while design solutions are feasible-not-novel
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- The Ideation-Execution Gap: Execution Outcomes of LLM-Generated versus Human Research Ideas
- Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers
- Agent Laboratory: Using LLM Agents as Research Assistants
- Has the Creativity of Large-Language Models peaked? —an analysis of inter- and intra-LLM variability —
- What Does It Take to Be a Good AI Research Agent? Studying the Role of Ideation Diversity
- Conceptual Design Generation Using Large Language Models
- The LLM Fallacy: Misattribution in AI-Assisted Cognitive Workflows
- AI Meets the Classroom: When Does ChatGPT Harm Learning?
Original note title
LLM-generated research ideas suffer an ideation-execution gap — ideas rated as novel at ideation score significantly lower after expert execution on all metrics