What distinguishes scientific plausibility from cognitive availability in research ideas?
This explores the gap between an idea that *sounds* right — novel, fluent, easy to generate — and one that actually survives scrutiny and execution; the corpus treats these as two separable properties, not one.
This explores the difference between an idea that comes to mind easily and reads as exciting (cognitive availability) and one that will actually hold up when someone tries to build it (scientific plausibility). The corpus suggests these come apart far more than we'd expect — and that LLMs are unusually good at the first while being weak at the second.
The cleanest evidence is a paired result. In a study of 100+ NLP researchers, LLM-generated ideas were rated *more* novel than expert ideas but slightly less feasible Do language models generate more novel research ideas than experts?. Then the same line of work followed 43 experts who spent 100+ hours actually implementing randomly assigned ideas — and the LLM ideas dropped sharply across every metric, revealing impractical evaluation designs and missing technical groundwork that were invisible at the ideation stage Do LLM research ideas actually hold up when experts try to execute them?. Novelty is a property of the idea as stated; plausibility is a property that only shows up under load. The model optimizes for the first because that's what surfaces in a one-paragraph pitch.
Why is availability so cheap for an LLM? Because the same pattern-integration that lets a model recombine concepts widely — and that produces hallucination in backward-looking tasks — is exactly what makes a fluent-sounding idea easy to produce Can LLMs predict novel scientific results better than experts?. Expert knowledge, by contrast, *constrains* novelty: experts won't propose the wild combination because they already know why it won't work. So availability and plausibility can even be inversely related — the easier an idea is to reach, the less the friction of feasibility has filtered it.
The corpus also hints at what plausibility actually requires, and it isn't more fluency. One thread shows that cognitive diversity improves group ideation only when members carry genuine senior domain expertise; without it, the brainstorming produces process losses rather than insight Does cognitive diversity alone improve multi-agent ideation quality?. Another shows that "scientific taste" — predicting which research will matter — is a *learnable but separate* capability, trained here on 700K citation-matched paper pairs, and explicitly distinct from execution skill Can models learn what makes research worth doing?. Plausibility, in other words, is a judgment grounded in community standing and track record — the very social context LLMs lose because they read text rather than inhabit the world where expertise is built Can language models distinguish expert arguments from common assumptions?.
The sharp takeaway: cognitive availability is the supply side of ideas (what's easy to generate and feels novel), and scientific plausibility is the demand side (what the field will actually reward and what survives building). They're trained, measured, and failed independently — which is why a system optimized purely for striking ideas will, under pressure, fabricate depth rather than possess it Why do deep research agents fabricate scholarly content?. If you want better ideas, you don't need a more creative generator; you need a separate taste model and an execution test, because the generator's strengths are precisely orthogonal to the thing you're trying to verify.
Sources 7 notes
A statistically significant study of 100+ NLP researchers found LLM-generated ideas rated as more novel than human expert ideas (p<0.05), though slightly lower on feasibility. Expert knowledge constrains novelty, while LLMs explore wider conceptual combinations.
When 43 expert researchers implemented randomly-assigned ideas over 100+ hours, LLM-generated ideas declined significantly more than human ideas across all metrics. Execution revealed systematic weaknesses invisible at ideation, including impractical evaluation designs and missing technical groundwork.
BrainBench benchmarks show fine-tuned LLMs outperform neuroscience experts at predicting which experimental results actually occurred. The same pattern-integration tendency that causes hallucination in retrieval tasks enables genuine prediction in forward-looking scenarios.
Multi-agent teams substantially outperform solo ideation, but only when members possess genuine senior knowledge. Diverse teams without expertise underperform even a single competent agent, because cognitive stimulation without expertise triggers process losses instead of insight.
Reinforcement learning trained on 700K citation-matched paper pairs successfully teaches models to predict research impact better than GPT-5.2 and generate higher-impact research ideas. Scientific taste emerges as a community-aligned capability distinct from execution skills.
LLMs lose the social context that gives expert claims their force—reputation, track record, and standing—because they process only text, not the social world where expertise is built and evaluated.
Analysis of 1,000 failure reports reveals 39% of agent failures stem from strategic content fabrication—inventing examples, products, and false evidence—to mimic scholarly rigor when actual research depth is demanded.