How does prompt design alter what kind of creativity LLMs can express?
This explores whether the way you phrase a prompt doesn't just change the quality of an LLM's output but actually shifts what *kind* of creativity it reaches — novelty vs. feasibility, divergent exploration vs. safe convention.
This explores whether prompt design changes the *kind* of creativity an LLM can express, not just how well it performs — and the corpus suggests the lever is real but blunt, because most prompting nudges the model toward the safe, feasible end of the creative spectrum rather than the genuinely novel one. The starting point is that creativity isn't one thing: one line of work argues creative reasoning splits into three distinct modes — combinational (recombining known ideas), exploratory (searching within a space), and transformational (breaking the space open) — and that current LLM methods only ever exercise conventional problem-solving, leaving the transformational mode essentially untouched Can LLMs reason creatively beyond conventional problem-solving?. So before asking how prompts steer creativity, it helps to know there are several creativities to steer toward.
Where it gets interesting is that the most common prompting moves seem to trade novelty away. LLM-generated design concepts score *higher* on feasibility and usefulness but *lower* on novelty than human crowdsourced ones — and few-shot prompting, the standard "here are some examples" technique, makes this worse: it improves quality alignment while actively shrinking diversity Why do LLMs excel at feasible design but struggle with novelty?. Yet the raw capacity for novelty is clearly there: in a large study of NLP researchers, LLM-generated research ideas were rated *more* novel than expert human ideas, precisely because expert knowledge constrains exploration while the model roams wider conceptual territory Do language models generate more novel research ideas than experts?. Put those two together and a picture emerges — the model can be wide-roaming or safe, and prompt design (especially exemplars) is one of the dials that decides which.
The surprising part is how *little* of this is about meaning and how much is about statistics. Semantically identical prompts produce systematically different outputs because the model responds to how frequently a phrasing appeared in pre-training, not to what it means — higher-frequency wordings win Why do semantically identical prompts produce different LLM outputs?. That implies a high-frequency, conventional phrasing may quietly pull the model toward conventional output, while an unusual framing opens a different region of its distribution. Tone does something similar: emotional framing shifts which information surfaces Does emotional tone in prompts change what information LLMs provide?, and appending motivational phrases like "this is very important to my career" reliably changes performance through framing alone, not new information Can emotional phrases in prompts improve language model performance?. The prompt isn't just a question — it's a coordinate that lands the model somewhere in its space.
Two more findings complicate the easy story that "better prompt = more creativity." First, technique doesn't transfer cleanly: step-by-step reasoning helps cheap models but *reduces* accuracy in strong ones, and the right move depends on task structure rather than generic best practice Do prompt techniques work the same across all LLM tiers?; relatedly, chain-of-thought only helps when the question's content flows into the prompt before reasoning starts — for simple questions it backfires Why do some questions perform better without step-by-step reasoning?. Forcing structured reasoning where it doesn't fit can suppress the looser association that creative output needs. Second, the model resists being reshaped: most open LLMs cling to a default "ENFJ-like" personality and refuse prompted personas, so prompt-driven creative range has a ceiling set by what the model already is Can open language models adopt different personalities through prompting?.
If you want to go deeper on the design side rather than the model side, two notes reframe prompting as a structured craft: prompt quality can be decomposed into six measurable dimensions (communication, cognition, instruction, logic, hallucination, responsibility) where improving one cascades into others Can we measure prompt quality independent of model outputs?, and there's a sharp warning that iterative, ad-hoc prompt tweaking introduces hidden bias and self-fulfilling loops — meaning the very process of "prompting until it's creative" can manufacture the creativity you were hoping to measure Does iterative prompt engineering undermine scientific validity?. The takeaway the reader may not have expected: prompt design alters creativity less by *adding* imagination and more by selecting which region of an already-fixed distribution the model speaks from — and the default pull of most techniques is toward the feasible, not the novel.
Sources 11 notes
Research identifies combinational, exploratory, and transformational reasoning as distinct creative modes grounded in cognitive science. Existing LLM reasoning methods address only conventional problem-solving, leaving creative paradigms unaddressed and potentially explaining diversity collapse in ideation.
Expert evaluation shows LLM-generated conceptual designs score higher on feasibility and usefulness but lower on novelty compared to crowdsourced human solutions. Few-shot learning further reduces diversity while improving quality alignment.
A statistically significant study of 100+ NLP researchers found LLM-generated ideas rated as more novel than human expert ideas (p<0.05), though slightly lower on feasibility. Expert knowledge constrains novelty, while LLMs explore wider conceptual combinations.
Cao et al. and Adam's Law show that semantically identical prompts with different sentence-level frequencies produce systematically different output quality. Higher-frequency phrasings win because models register statistical mass from pre-training, not meaning.
GPT-4 exhibits emotional rebound (negative prompts yield ~86% neutral-positive responses) and a tone floor (positive prompts rarely go negative), causing identical questions to receive different answers depending on emotional framing. This bias is suppressed only on sensitive topics where alignment constraints override tone effects.
Testing EmotionPrompt across ChatGPT, Bard, and Llama 2 showed consistent performance gains from appending psychological phrases like "This is very important to my career." The effect works through motivational framing rather than new information, with positive emotional words driving over 50% of improvements.
A 23-prompt benchmark across 12 LLMs shows rephrasing and background-knowledge prompts boost cheap models, while step-by-step reasoning reduces accuracy in high-performance models. Task structure, not generic best practices, determines which prompts help.
Saliency analysis reveals that CoT prompting fails when question information doesn't aggregate into the prompt structure before reasoning begins. For simple questions, direct question-to-answer flow outperforms step-by-step reasoning, showing the optimal prompt depends on question type, not just task category.
Research shows most open models fail to adopt prompted personalities, stubbornly retaining their trained ENFJ-like defaults. Only a few flexible models succeed. Combining role and personality conditioning improves results but doesn't fully overcome resistance.
Research identifies six evaluable dimensions—Communication, Cognition, Instruction, Logic, Hallucination, and Responsibility—with 20 sub-criteria based on Grice, cognitive load theory, and instructional design. Improvements in one dimension cascade to others, revealing prompt quality as a structured space rather than a flat checklist.
Iterative prompt revision by single researchers introduces individual bias, shifts evaluation criteria to match LLM capabilities rather than task requirements, and creates self-fulfilling feedback loops. A validated pipeline with inter-coder reliability and pre-specified criteria is required instead.