INQUIRING LINE

What other pragmatic prompt features have unstable effects?

This explores which prompt features beyond the obvious wording — tone, phrasing, persona, reasoning style — produce unpredictable effects that flip or swing depending on the model.


This explores which prompt features beyond literal wording — tone, phrasing, persona, reasoning steps — behave unstably, meaning the same move helps in one setting and hurts in another. The corpus suggests the instability is the rule, not the exception, and it traces back to a single fact: models respond to statistical patterns in their training data, not to the meaning or social intent you think you're encoding.

Start with politeness, the cleanest example. Rude prompts actually beat polite ones on GPT-4o, reversing what was true on earlier models Does prompt politeness change how accurate language models are?. The effect didn't just weaken — it flipped direction across model generations, which means tone is not a stable design principle at all. The same pattern shows up with reasoning style: step-by-step (chain-of-thought) prompting boosts cheap models but actively *reduces* accuracy in high-performance ones Do prompt techniques work the same across all LLM tiers?, and even within one model, CoT only helps when the question's information flows into the prompt structure before reasoning begins — for simple questions, asking directly beats asking it to think step by step Why do some questions perform better without step-by-step reasoning?.

The most unsettling case is paraphrasing. Two prompts that mean exactly the same thing produce systematically different output quality — not because of meaning, but because one phrasing appears more frequently in pre-training Why do semantically identical prompts produce different LLM outputs?. So even "say it more clearly" is an unstable lever, because clarity isn't what the model is scoring; corpus frequency is. Persona prompts fail for a related reason: run the same persona repeatedly and the output varies as much across runs as it does across *different* personas, because model uncertainty drowns out whatever social knowledge you're trying to invoke Why do LLM persona prompts produce inconsistent outputs across runs?.

There's a unifying signal underneath all of this. Prompt sensitivity tracks model confidence: when a model is confident, it shrugs off rephrasing; when it's uncertain, small wording changes cause large output swings Does model confidence predict robustness to prompt changes?. Larger models, few-shot examples, and objective tasks all raise confidence and therefore stability — which reframes "unstable prompt features" as a symptom of low-confidence regions rather than a property of the feature itself. It also explains why generic vague prompts collapse into bland, blended answers: the model falls back on training-data priors when you haven't given it enough scaffolding to be confident about Why do large language models produce generic responses to vague queries?.

The practical takeaway runs against the whole genre of "prompt best practices." Rather than chasing tone tricks or universal phrasings, the corpus points toward features that are stable because they're *structural*: prompt quality has six measurable dimensions grounded in communication theory, where improving one cascades to others Can we measure prompt quality independent of model outputs?, and forcing explicit argument structure — checking warrants and backing — reliably improves reasoning where free-form chain-of-thought wanders Can structured argument prompts make LLM reasoning more rigorous?. The reliable levers are the ones that add genuine information or constraint; the unstable ones are the ones that merely nudge surface form and hope the statistics break your way.


Sources 9 notes

Does prompt politeness change how accurate language models are?

Testing 250 tone variants across ChatGPT-4o showed accuracy rose from 80.8% (Very Polite) to 84.8% (Very Rude), contradicting prior findings on GPT-3.5. The directional flip suggests tone effects are model-generation-dependent, not stable design principles.

Do prompt techniques work the same across all LLM tiers?

A 23-prompt benchmark across 12 LLMs shows rephrasing and background-knowledge prompts boost cheap models, while step-by-step reasoning reduces accuracy in high-performance models. Task structure, not generic best practices, determines which prompts help.

Why do some questions perform better without step-by-step reasoning?

Saliency analysis reveals that CoT prompting fails when question information doesn't aggregate into the prompt structure before reasoning begins. For simple questions, direct question-to-answer flow outperforms step-by-step reasoning, showing the optimal prompt depends on question type, not just task category.

Why do semantically identical prompts produce different LLM outputs?

Cao et al. and Adam's Law show that semantically identical prompts with different sentence-level frequencies produce systematically different output quality. Higher-frequency phrasings win because models register statistical mass from pre-training, not meaning.

Why do LLM persona prompts produce inconsistent outputs across runs?

When the same persona prompt is run repeatedly, output variance across runs matches or exceeds variance across different personas. This reveals that model uncertainty, not stable social knowledge, drives persona-simulated outputs, making them unsuitable for simulating human annotation disagreement.

Does model confidence predict robustness to prompt changes?

ProSA found that when models are highly confident, they resist prompt rephrasing; low confidence causes major output swings. Larger models, few-shot examples, and objective tasks all correlate with higher confidence and greater robustness.

Why do large language models produce generic responses to vague queries?

Unlike social-media context collapse, which flattens multiple audiences, LLM collapse occurs when users provide insufficient contextual scaffolding and models default to blended training-data priors. This distinction suggests remedies should focus on query verification and user-driven context specification rather than platform controls.

Can we measure prompt quality independent of model outputs?

Research identifies six evaluable dimensions—Communication, Cognition, Instruction, Logic, Hallucination, and Responsibility—with 20 sub-criteria based on Grice, cognitive load theory, and instructional design. Improvements in one dimension cascade to others, revealing prompt quality as a structured space rather than a flat checklist.

Can structured argument prompts make LLM reasoning more rigorous?

Applying Toulmin's argument model as explicit prompting steps (CQoT) improves LLM reasoning by forcing models to identify warrants and backing rather than skipping implicit premises. The method catches failures that standard chain-of-thought prompting allows.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing prompt-engineering constraints. The question: Which prompt features (tone, phrasing, persona, reasoning style) have *unstable* effects — helping in one context but hurting in another?

What a curated library found — and when (dated claims, not current truth):
Findings span 2022–2026. Key constraints from the path:
• Politeness flipped direction across model generations; rude prompts now beat polite ones on GPT-4o (2025-10, arXiv:2510.04950).
• Chain-of-thought reasoning *reduces* accuracy in high-performance models while boosting cheaper ones; stability depends on question structure and information flow (2023-07, 2024-12).
• Paraphrase-equivalent prompts yield different outputs because model response correlates with pre-training frequency, not semantic meaning (2026-04, arXiv:2604.02176).
• Persona prompts show output variance across runs equal to variance across *different* personas, drowning out intended social signal (2025-10, arXiv:2511.00222).
• Prompt sensitivity tracks model confidence: uncertain models are sensitive; confident ones are robust (2025-06, arXiv:2506.06950).

Anchor papers (verify; mind their dates):
• arXiv:2510.04950 (2025-10) — tone instability across model tiers.
• arXiv:2604.02176 (2026-04) — frequency-driven prompt response.
• arXiv:2412.15177 (2024-12) — argumentation structure vs. free-form reasoning.
• arXiv:2506.06950 (2025-06) — six-dimension prompt-quality framework.

Your task:
(1) RE-TEST EACH CONSTRAINT. For politeness-flip, tone-instability, CoT-reversal, paraphrase-indifference, and persona-variance: has newer training (RLHF, synthetic data, constitutive AI), scaling, or evaluation methodology since relaxed or overturned these findings? Are these still genuine instabilities or artifacts of evaluation setup? Where does confidence-as-stabilizer still hold?
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months—particularly any showing prompt features *do* generalize robustly, or explaining why instability itself varies by task family.
(3) Propose 2 research questions that ASSUME the instability regime may have shifted: (a) Does instruction-tuning on consistency-valued data reduce prompt sensitivity? (b) Do multi-agent orchestration or retrieval-augmented prompting trade instability for structural robustness?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines