INQUIRING LINE

Do novelty and feasibility always trade off in idea generation?

This explores whether the novelty-feasibility trade-off in idea generation is an iron law or an artifact of how we measure and structure ideation — and the corpus suggests it's softer and more decomposable than it looks.


This explores whether novelty and feasibility *always* pull against each other when generating ideas, or whether that tension is conditional. The pattern across the corpus is real but not absolute. On one side, LLMs reliably produce solutions that score high on feasibility and usefulness but low on novelty Why do LLMs excel at feasible design but struggle with novelty?. On the other, multiple studies find the mirror image: LLM research ideas rate as *more* novel than human experts' but slightly less feasible Do language models generate more novel research ideas than experts?, Why do LLMs generate more novel research ideas than experts?. So the trade-off shows up in both directions depending on the task — which is the first clue that it isn't a fixed law of nature.

The more interesting move is *why* the trade-off appears. One reading is that novelty without disciplinary constraint is cheap: LLMs explore wider conceptual combinations precisely because they don't carry the expert's instinct for what won't work Can LLMs generate more novel ideas than human experts?. The cost surfaces only at execution — when 43 researchers actually tried to build LLM-generated ideas over 100+ hours, the ideas dropped sharply, revealing impractical evaluation designs and missing technical groundwork that were invisible at the ideation stage Do LLM research ideas actually hold up when experts try to execute them?. Read this way, novelty and feasibility don't trade off at the moment of generation; the bill for feasibility just comes due later. The 'trade-off' is partly a deferral.

There's also a measurement story that should make you suspicious of treating the trade-off as fundamental. In the closely related exploration-exploitation framing, hidden-state analysis found near-zero correlation between the two — the apparent trade-off emerges only when you measure at the token level, and a method that targets the right representation improved both at once Is the exploration-exploitation trade-off actually fundamental?. The same caution applies here: if novelty and feasibility look opposed, it may be because of how they're scored, not because of an underlying conflict. Tellingly, LLMs that produce individually novel ideas often cluster them in narrow regions — high novelty per idea, low diversity across ideas Why do LLMs generate novel ideas from narrow ranges? — which is not what you'd expect if a clean novelty-feasibility dial were the whole picture.

What seems to actually shift the frontier is structure and expertise rather than a sacrifice of one axis for the other. Cognitive diversity in multi-agent teams improves ideation quality — but *only* when members carry genuine domain expertise; diversity without expertise underperforms even a single competent agent Does cognitive diversity alone improve multi-agent ideation quality?. Likewise, allocating compute to diverse abstractions produces structured breadth instead of either shallow novelty or narrow depth Can abstractions guide exploration better than depth alone?, and creativity researchers argue that combinational, exploratory, and transformational reasoning are distinct modes most current methods ignore entirely Can LLMs reason creatively beyond conventional problem-solving?. These point toward expanding the possibility space, not trading along a single line.

The quiet takeaway: the most promising route past the trade-off is decoupling *generation* from *evaluation*. LLMs generate novelty well but systematically avoid the evaluative stance feasibility requires Can LLMs generate more novel ideas than human experts?, and naive self-evaluation overestimates novelty by ~60% Why do LLMs generate more novel research ideas than experts?. But a structured pipeline that extracts claims, retrieves related work, and compares reached 86% alignment with human reviewers on novelty Can structured pipelines make LLM novelty assessment reliable?. If you let a system roam freely for novelty and then apply a separate, well-built feasibility filter, you don't have to buy feasibility with less novelty — you generate widely, then prune. The trade-off is most binding when one model is forced to do both jobs at once.


Sources 11 notes

Why do LLMs excel at feasible design but struggle with novelty?

Expert evaluation shows LLM-generated conceptual designs score higher on feasibility and usefulness but lower on novelty compared to crowdsourced human solutions. Few-shot learning further reduces diversity while improving quality alignment.

Do language models generate more novel research ideas than experts?

A statistically significant study of 100+ NLP researchers found LLM-generated ideas rated as more novel than human expert ideas (p<0.05), though slightly lower on feasibility. Expert knowledge constrains novelty, while LLMs explore wider conceptual combinations.

Why do LLMs generate more novel research ideas than experts?

Research shows LLM-generated ideas are statistically more novel than expert-produced ideas, but LLMs struggle to evaluate quality—automated evaluation overestimates by 60%. When executed, LLM ideas drop significantly on all metrics, suggesting novelty without feasibility.

Can LLMs generate more novel ideas than human experts?

LLMs produce more novel research ideas than experts because they lack disciplinary constraints, but they systematically avoid evaluative stance-taking required to assess feasibility or validity. Generation and evaluation are dissociated capabilities.

Do LLM research ideas actually hold up when experts try to execute them?

When 43 expert researchers implemented randomly-assigned ideas over 100+ hours, LLM-generated ideas declined significantly more than human ideas across all metrics. Execution revealed systematic weaknesses invisible at ideation, including impractical evaluation designs and missing technical groundwork.

Is the exploration-exploitation trade-off actually fundamental?

Hidden-state analysis using Effective Rank metrics shows near-zero correlation between exploration and exploitation, revealing the trade-off emerges only at token level. VERL demonstrates simultaneous enhancement achieving 21.4% accuracy gains on Gaokao 2024.

Why do LLMs generate novel ideas from narrow ranges?

LLM-generated research ideas are rated individually novel but lack diversity, clustering in narrow generative regions. Combined with LLM self-evaluation failures, this limits the possibility space explored compared to human ideation across different conceptual territories.

Does cognitive diversity alone improve multi-agent ideation quality?

Multi-agent teams substantially outperform solo ideation, but only when members possess genuine senior knowledge. Diverse teams without expertise underperform even a single competent agent, because cognitive stimulation without expertise triggers process losses instead of insight.

Can abstractions guide exploration better than depth alone?

RLAD jointly trains abstraction and solution generators, showing that allocating test-time compute to diverse abstractions outperforms parallel solution sampling at large budgets. Abstractions create structured breadth-first exploration that prevents the underthinking failure mode of depth-only reasoning chains.

Can LLMs reason creatively beyond conventional problem-solving?

Research identifies combinational, exploratory, and transformational reasoning as distinct creative modes grounded in cognitive science. Existing LLM reasoning methods address only conventional problem-solving, leaving creative paradigms unaddressed and potentially explaining diversity collapse in ideation.

Can structured pipelines make LLM novelty assessment reliable?

A three-stage pipeline (extract claims, retrieve related work, compare) reached 86.5% reasoning alignment and 75.3% conclusion agreement with human reviewers on 182 ICLR submissions, outperforming holistic LLM baselines.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about novelty–feasibility trade-offs in LLM idea generation. The question: *Do novelty and feasibility always trade off, or is the tension conditional and resolvable?*

What a curated library found — and when (findings span 2023–2025; treat as dated claims):
• LLMs generate feasible, useful solutions but low-novelty ones in conceptual design (2023); conversely, LLM research ideas score *higher* novelty than human experts but lower feasibility (2024)—the trade-off appears in both directions depending on task.
• Novelty without feasibility constraint is "cheap": LLMs explore wider combinations because they lack expert intuition for infeasibility; the cost surfaces only at execution—100+ hours of actual research revealed ideation-execution gaps (2025).
• The apparent trade-off may be an artifact of measurement: near-zero token-level correlation between novelty and feasibility in reasoning; a representation-targeting method improved both simultaneously (2025).
• Decoupling generation from evaluation dissolves the trade-off: LLMs generate freely, then structured novelty assessment (retrieval + claim extraction) reaches 86% human alignment without sacrificing breadth (2024–2025).
• Multi-agent teams improve ideation *only* when members carry genuine domain expertise; cognitive diversity without expertise underperforms a single competent agent (2025).

Anchor papers (verify; mind their dates):
- arXiv:2306.01779 (2023): Conceptual Design Generation Using LLMs
- arXiv:2409.04109 (2024): Can LLMs Generate Novel Research Ideas?
- arXiv:2506.20803 (2025): The Ideation-Execution Gap
- arXiv:2509.23808 (2025): Beyond Exploration-Exploitation Trade-off—Hidden State Approach

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For every finding above, assess whether newer models (o1-class reasoning, extended thinking), structured retrieval pipelines, or multi-agent orchestration (memory, expertise-gating, iterative refinement) have since relaxed or overturned the trade-off. Separate the durable question (whether novelty and feasibility *must* conflict) from perishable limitations (current measurement noise, single-model bottlenecks). Cite what resolved each, plainly flag where constraints still hold.
(2) **Surface strongest CONTRADICTING or SUPERSEDING work** from the last ~6 months. Look for papers claiming the trade-off is fundamental, or that newest reasoning models *do* unify novelty + feasibility without pipeline tricks.
(3) **Propose 2 research questions** assuming the regime has moved: e.g., *What representation or training objective lets a single model generate both novel* and *feasible ideas without decoupling?* or *Does multi-agent disagreement on feasibility reveal hidden feasibility signals that improve novelty filtering?*

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines