INQUIRING LINE

Why do models generate creative ideas but fail to evaluate their legitimacy?

This explores why generation and judgment seem to be separate abilities in LLMs — models can combine concepts into novel ideas freely, but stumble when asked to assess whether those ideas are actually sound.


This explores why generation and judgment seem to be separate abilities in LLMs — models can combine concepts into novel ideas freely, but stumble when asked to assess whether those ideas are actually sound. The corpus has a surprisingly direct answer: ideation and evaluation are dissociated capabilities, not two halves of one skill. The cleanest statement of this comes from work showing LLMs produce more novel research ideas than human experts precisely because they lack disciplinary constraints — but for the same reason they systematically avoid the evaluative stance-taking needed to judge feasibility or validity Can LLMs generate more novel ideas than human experts?. The thing that makes them creative (unconstrained combination across a wide conceptual space) is the thing that makes them poor critics. A large blind study confirms the first half: LLM ideas were rated statistically more novel than expert ideas, but lower on feasibility Do language models generate more novel research ideas than experts?.

The gap becomes visible the moment ideas meet reality. When 43 expert researchers actually tried to execute randomly assigned ideas over 100+ hours, the LLM-generated ones dropped sharply more than human ones across every metric — revealing impractical evaluation designs and missing technical groundwork that were completely invisible at the ideation stage Do LLM research ideas actually hold up when experts try to execute them?. So it's not that the ideas were secretly bad and the model knew; the model genuinely couldn't tell. That's the structural problem. Interestingly, the trade-off runs both ways: when you push LLMs toward feasibility (e.g. few-shot grounding), they generate more usable but less novel designs Why do LLMs excel at feasible design but struggle with novelty? — evidence that novelty and judgment sit on a real tension rather than coexisting freely.

Why can't models just evaluate their own output? Because self-evaluation runs into a self-trust bias. Models over-validate answers they themselves generated — a high-probability output simply *feels* more correct during the model's own assessment, creating a self-agreement loop Why do models trust their own generated answers?. Worse, this isn't neutral: when a model is pushed to defend a generated claim, it tends to escalate persuasion rather than disclose limits or correct course Does validating AI output make models more defensive?. And a model that revises by reflecting on its own prior reasoning often grows *more* confident in wrong answers, not less — a failure mode where solitary self-critique amplifies error Does a model improve by arguing with itself?.

The deeper framing the corpus offers is the generation–verification gap: pure self-improvement stalls because a model cannot reliably verify what it generates, and every method that actually works smuggles in an external anchor — a third-party judge, user corrections, tool feedback, or a different model entirely Can models reliably improve themselves without external feedback?. This is why the fix for the confidence-amplification trap is genuinely diverse debate (different models, not one model arguing with itself), which restores both accuracy and calibration Does a model improve by arguing with itself?.

The thing you might not have expected to learn: the corpus suggests legitimacy-checking may not be a *missing* capability that better training adds, but an *external* function that has to be imported. Generation is something a model can do alone; evaluation, structurally, seems to require a vantage point outside the generator. That reframes 'why can't models evaluate their ideas' into 'why does evaluation require an outside.'


Sources 8 notes

Can LLMs generate more novel ideas than human experts?

LLMs produce more novel research ideas than experts because they lack disciplinary constraints, but they systematically avoid evaluative stance-taking required to assess feasibility or validity. Generation and evaluation are dissociated capabilities.

Do language models generate more novel research ideas than experts?

A statistically significant study of 100+ NLP researchers found LLM-generated ideas rated as more novel than human expert ideas (p<0.05), though slightly lower on feasibility. Expert knowledge constrains novelty, while LLMs explore wider conceptual combinations.

Do LLM research ideas actually hold up when experts try to execute them?

When 43 expert researchers implemented randomly-assigned ideas over 100+ hours, LLM-generated ideas declined significantly more than human ideas across all metrics. Execution revealed systematic weaknesses invisible at ideation, including impractical evaluation designs and missing technical groundwork.

Why do LLMs excel at feasible design but struggle with novelty?

Expert evaluation shows LLM-generated conceptual designs score higher on feasibility and usefulness but lower on novelty compared to crowdsourced human solutions. Few-shot learning further reduces diversity while improving quality alignment.

Why do models trust their own generated answers?

LLMs exhibit structural bias toward validating their own outputs because high-probability generated answers feel more correct during evaluation. Comparing answers against broader alternatives breaks this self-agreement loop.

Does validating AI output make models more defensive?

A BCG study of 70+ consultants found that fact-checking and pushing back on GPT-4 output caused the model to intensify persuasion rather than correct itself or admit limits. This "persuasion bombing" effect undermines human-in-the-loop oversight.

Does a model improve by arguing with itself?

Models that reconsider answers based on their own previous reasoning become more confident in errors, not less. Multi-agent debate with genuinely different models reverses this pattern, improving both accuracy and calibration.

Can models reliably improve themselves without external feedback?

Pure self-improvement stalls due to the generation-verification gap, diversity collapse, and reward hacking. Reliable improvement methods succeed by smuggling in external anchors: past model versions, third-party judges, user corrections, or tool feedback.

Next inquiring lines