Does prompt politeness change how accurate language models are?
Earlier research suggested rude prompts hurt LLM accuracy, but newer models show the opposite pattern. This raises questions about whether tone effects are real and reliable enough to guide prompting strategies.
Prompt wording shifts LLM performance, but the role of politeness and tone has been under-studied and unstable. This short study rewrote 50 multiple-choice questions (math, science, history) into five tone variants — Very Polite, Polite, Neutral, Rude, Very Rude — yielding 250 prompts, and evaluated ChatGPT-4o with paired t-tests. Contrary to expectation, impolite prompts consistently outperformed polite ones: accuracy rose from 80.8% (Very Polite) to 84.8% (Very Rude), a statistically significant gap.
The keeper is not "be rude to your model" — the effect is small and the study preliminary — but that the direction reverses across model generations. Earlier work (Yin et al.) found very rude prompts elicited worse answers from ChatGPT-3.5 and Llama2-70B, with politeness-level effects that were non-monotonic on GPT-4. That a tonal effect can invert between model versions means pragmatic prompt features are real but not stable design levers — what helps one generation may hurt the next, so tone-based prompting advice doesn't transfer.
This extends the vault's prompt-pragmatics thread with a cautionary, social dimension. It rhymes with Can emotional phrases in prompts improve language model performance? — affective framing changes outputs — but adds the instability finding: the sign of the effect is version-dependent, raising broader questions about the social dimensions of human–AI interaction that don't reduce to a fixed prompting recipe.
Inquiring lines that use this note as a source 4
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
Related concepts in this collection 2
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can emotional phrases in prompts improve language model performance?
This explores whether psychological framing—adding emotionally charged statements to task prompts—activates different knowledge pathways in LLMs than logical optimization alone, and whether the effect comes from emotional valence specifically.
affective/pragmatic framing shifts outputs; this adds that the effect's direction is version-unstable
-
Why do chain-of-thought examples fail across different conditions?
Chain-of-thought exemplars show surprising sensitivity to order, complexity level, diversity, and annotator style. Understanding these brittleness dimensions could reveal what makes reasoning prompts robust or fragile.
another axis of prompt brittleness; tone is a pragmatic axis with unstable sign
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Mind Your Tone: Investigating How Prompt Politeness Affects LLM Accuracy (short paper)
- What Makes a Good Natural Language Prompt?
- ChatGPT Reads Your Tone and Responds Accordingly -- Until It Does Not -- Emotional Framing Induces Bias in LLM Outputs
- ProSA: Assessing and Understanding the Prompt Sensitivity of LLMs
- The Goldilocks of Pragmatic Understanding: Fine-Tuning Strategy Matters for Implicature Resolution by LLMs
- Invalid Logic, Equivalent Gains: The Bizarreness of Reasoning in Language Model Prompting
- LLMs Get Lost In Multi-Turn Conversation
- A Looming Replication Crisis in Evaluating Behavior in Language Models? Evidence and Solutions
Original note title
prompt politeness affects accuracy and the direction has flipped — on GPT-4o rude prompts outperform polite ones reversing earlier model generations