Do LLM research ideas actually hold up when experts try to execute them?

Explores whether LLM-generated ideas maintain their apparent novelty advantage when expert researchers spend 100+ hours implementing them. Matters because ideation-stage evaluation may not capture real-world feasibility barriers.

Synthesis note · 2026-03-30 · sourced from Work Application Use Cases

The ideation novelty finding (Si et al. 2025) showed LLM-generated research ideas rated significantly more novel than human expert ideas. This execution study provides the empirical reality check: when 43 expert researchers each spend over 100 hours implementing randomly-assigned ideas and writing 4-page papers, the novelty advantage disappears.

Comparing review scores before and after execution, "the scores of the LLM-generated ideas decrease significantly more than expert-written ideas on all evaluation metrics (novelty, excitement, effectiveness, and overall; p<0.05), closing the gap between LLM and human ideas observed at the ideation stage." For many metrics, there is a ranking flip where human ideas score higher after execution.

The mechanism is precise: execution imposes feasibility constraints that ideation evaluation cannot anticipate. "During execution, every single step has to be grounded in realistic execution constraints, which impose higher feasibility standards than the ideation stage." Reviewers discover weaknesses only visible through implementation — missing baselines, poor generalizability, impractical evaluation designs, high resource requirements. AI-generated ideas systematically propose evaluations requiring human expert recruitment that executors always change to save cost and time.

This resolves the tension between Can LLMs generate more novel ideas than human experts? and Why do LLMs excel at feasible design but struggle with novelty?. The ideation-evaluation dissociation IS the problem — LLMs generate novel-sounding ideas precisely because they lack the evaluative capacity to recognize execution barriers. Novelty at ideation is a property of description quality, not executability. Since Why do LLMs generate novel ideas from narrow ranges?, individual LLM ideas may be novel AND individually infeasible — the two findings compound rather than contradict.

The implication for AI-assisted research is that proxy evaluation (judging ideas without execution) systematically overestimates LLM contribution. "Objective metrics like feasibility and effectiveness are best judged via the actual execution outcomes rather than speculative judgment based on the ideas." This challenges any benchmark or evaluation that rates AI research capability without implementation.

Inquiring lines that use this note as a source 22

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

14 direct connections · 119 in 2-hop network ·medium cluster Open in graph ↗

Do LLM research ideas actually hold up when expe… Can LLMs generate more novel ideas than human expe… Why do LLMs generate novel ideas from narrow range… Why do LLMs excel at feasible design but struggle …

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can LLMs generate more novel ideas than human experts? Research shows LLM-generated ideas score higher for novelty than expert-generated ones, yet LLMs avoid the evaluative reasoning that characterizes expert thinking. What explains this apparent contradiction?
the ideation-execution gap is the empirical consequence of this dissociation
Why do LLMs generate novel ideas from narrow ranges? LLM research agents produce individually novel ideas but cluster them in homogeneous sets. This explores why high average novelty coexists with poor diversity coverage and what it means for automated ideation.
compounds: individually novel + collectively homogeneous + individually infeasible
Why do LLMs excel at feasible design but struggle with novelty? When LLMs generate conceptual product designs, they produce more implementable and useful solutions than humans but fewer novel ones. This explores why domain constraints flip the novelty advantage seen in research ideation.
domain inversion: research ideas are novel-not-feasible while design solutions are feasible-not-novel

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

LLM-generated research ideas suffer an ideation-execution gap — ideas rated as novel at ideation score significantly lower after expert execution on all metrics

Do LLM research ideas actually hold up when experts try to execute them?

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 4