SYNTHESIS NOTE
Psychology, Society, and Alignment Reasoning, Retrieval, and Evaluation

Do LLM research ideas actually hold up when experts try to execute them?

Explores whether LLM-generated ideas maintain their apparent novelty advantage when expert researchers spend 100+ hours implementing them. Matters because ideation-stage evaluation may not capture real-world feasibility barriers.

Synthesis note · 2026-03-30 · sourced from Work Application Use Cases
How do you build domain expertise into general AI models?

The ideation novelty finding (Si et al. 2025) showed LLM-generated research ideas rated significantly more novel than human expert ideas. This execution study provides the empirical reality check: when 43 expert researchers each spend over 100 hours implementing randomly-assigned ideas and writing 4-page papers, the novelty advantage disappears.

Comparing review scores before and after execution, "the scores of the LLM-generated ideas decrease significantly more than expert-written ideas on all evaluation metrics (novelty, excitement, effectiveness, and overall; p<0.05), closing the gap between LLM and human ideas observed at the ideation stage." For many metrics, there is a ranking flip where human ideas score higher after execution.

The mechanism is precise: execution imposes feasibility constraints that ideation evaluation cannot anticipate. "During execution, every single step has to be grounded in realistic execution constraints, which impose higher feasibility standards than the ideation stage." Reviewers discover weaknesses only visible through implementation — missing baselines, poor generalizability, impractical evaluation designs, high resource requirements. AI-generated ideas systematically propose evaluations requiring human expert recruitment that executors always change to save cost and time.

This resolves the tension between Can LLMs generate more novel ideas than human experts? and Why do LLMs excel at feasible design but struggle with novelty?. The ideation-evaluation dissociation IS the problem — LLMs generate novel-sounding ideas precisely because they lack the evaluative capacity to recognize execution barriers. Novelty at ideation is a property of description quality, not executability. Since Why do LLMs generate novel ideas from narrow ranges?, individual LLM ideas may be novel AND individually infeasible — the two findings compound rather than contradict.

The implication for AI-assisted research is that proxy evaluation (judging ideas without execution) systematically overestimates LLM contribution. "Objective metrics like feasibility and effectiveness are best judged via the actual execution outcomes rather than speculative judgment based on the ideas." This challenges any benchmark or evaluation that rates AI research capability without implementation.

Inquiring lines that use this note as a source 22

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
14 direct connections · 119 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

LLM-generated research ideas suffer an ideation-execution gap — ideas rated as novel at ideation score significantly lower after expert execution on all metrics