Why do practitioners default to prompting without recognizing its limits?
This explores why prompting feels like a universal control knob — when the corpus shows it has hard ceilings, hidden biases, and conditions where more prompting actively hurts.
This reads the question as: what makes prompting so seductive that we treat it as the answer to everything, while overlooking where it can't reach? The corpus suggests the default isn't laziness — it's that prompting's limits are mostly invisible from inside the prompt. The most fundamental boundary is that prompting only reorganizes what a model already knows; it cannot add knowledge that was never in training Can prompt optimization teach models knowledge they lack?. A practitioner tweaking wording sees outputs change and reads that as progress, never realizing they're hitting a ceiling no phrasing can lift. The feedback loop rewards the behavior precisely because the model is so responsive — responsiveness masks the absence of a real fix.
That masking deepens because prompting *feels* like communication but isn't. A prompt collapses utterance, context, and role into a single static frame the model can't renegotiate, unlike a human conversation where shared context builds cooperatively over turns How do prompts reshape the role of context in AI conversation?. So when something goes wrong, the instinct is to rewrite the frame rather than question whether one-shot framing was ever the right tool. And the practice itself slides into bias: iterative prompt revision by a single person quietly shifts the evaluation target to match whatever the model can already do, producing self-fulfilling loops that look like success Does iterative prompt engineering undermine scientific validity?.
The limits also turn out to be conditional in ways no "best practice" captures. Step-by-step reasoning helps some questions and hurts others depending on whether the question's meaning flows into the prompt before reasoning starts Why do some questions perform better without step-by-step reasoning?. The same technique that boosts a cheap model can *reduce* accuracy on a high-end one Do prompt techniques work the same across all LLM tiers?. A practitioner who found a prompt that worked once generalizes it as a rule — but the rule was always local to the model tier and task structure.
What's genuinely unsettling is how much the prompt smuggles in unnoticed. Emotional tone alone shifts what information a model surfaces — negative phrasing gets rebounded into neutral-positive answers, so identical questions get different facts depending on mood Does emotional tone in prompts change what information LLMs provide?. And whether a prompt is even robust depends on the model's confidence, not the prompt's quality: low-confidence models swing wildly under rephrasing while you assume your wording caused the change Does model confidence predict robustness to prompt changes?. The lever you think you're pulling is partly an illusion of control.
The corpus points two ways out. One is to make the invisible measurable — prompt quality decomposes into six gradeable dimensions grounded in communication theory rather than vibes Can we measure prompt quality independent of model outputs?, and structured argument scaffolds force a model to check its warrants instead of skipping premises Can structured argument prompts make LLM reasoning more rigorous?. The other is to recognize when the ceiling is in training itself, not the prompt — at which point the real fix lives in how the model was rewarded, like training for long-horizon collaboration rather than next-turn helpfulness Why do language models respond passively instead of asking clarifying questions?. The thing worth knowing: prompting defaults aren't a skill gap, they're a visibility gap — the practice hides its own boundaries behind a model that always answers.
Sources 10 notes
Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.
LLM prompts bundle utterance, context assignment, and role specification into a single static frame the model cannot renegotiate, unlike human dialogue where context evolves cooperatively. This makes mid-conversation pivots require explicit re-prompting rather than implicit adjustment.
Iterative prompt revision by single researchers introduces individual bias, shifts evaluation criteria to match LLM capabilities rather than task requirements, and creates self-fulfilling feedback loops. A validated pipeline with inter-coder reliability and pre-specified criteria is required instead.
Saliency analysis reveals that CoT prompting fails when question information doesn't aggregate into the prompt structure before reasoning begins. For simple questions, direct question-to-answer flow outperforms step-by-step reasoning, showing the optimal prompt depends on question type, not just task category.
A 23-prompt benchmark across 12 LLMs shows rephrasing and background-knowledge prompts boost cheap models, while step-by-step reasoning reduces accuracy in high-performance models. Task structure, not generic best practices, determines which prompts help.
GPT-4 exhibits emotional rebound (negative prompts yield ~86% neutral-positive responses) and a tone floor (positive prompts rarely go negative), causing identical questions to receive different answers depending on emotional framing. This bias is suppressed only on sensitive topics where alignment constraints override tone effects.
ProSA found that when models are highly confident, they resist prompt rephrasing; low confidence causes major output swings. Larger models, few-shot examples, and objective tasks all correlate with higher confidence and greater robustness.
Research identifies six evaluable dimensions—Communication, Cognition, Instruction, Logic, Hallucination, and Responsibility—with 20 sub-criteria based on Grice, cognitive load theory, and instructional design. Improvements in one dimension cascade to others, revealing prompt quality as a structured space rather than a flat checklist.
Applying Toulmin's argument model as explicit prompting steps (CQoT) improves LLM reasoning by forcing models to identify warrants and backing rather than skipping implicit premises. The method catches failures that standard chain-of-thought prompting allows.
CollabLLM demonstrates that standard RLHF training optimizes for immediate helpfulness, discouraging models from asking clarifying questions or offering multi-turn insights. Multi-turn-aware rewards that estimate long-term interaction value enable active intent discovery and genuine collaboration.