Does prompt politeness change how accurate language models are?

Earlier research suggested rude prompts hurt LLM accuracy, but newer models show the opposite pattern. This raises questions about whether tone effects are real and reliable enough to guide prompting strategies.

Synthesis note · 2026-06-03 · sourced from Prompts Prompting

Prompt wording shifts LLM performance, but the role of politeness and tone has been under-studied and unstable. This short study rewrote 50 multiple-choice questions (math, science, history) into five tone variants — Very Polite, Polite, Neutral, Rude, Very Rude — yielding 250 prompts, and evaluated ChatGPT-4o with paired t-tests. Contrary to expectation, impolite prompts consistently outperformed polite ones: accuracy rose from 80.8% (Very Polite) to 84.8% (Very Rude), a statistically significant gap.

The keeper is not "be rude to your model" — the effect is small and the study preliminary — but that the direction reverses across model generations. Earlier work (Yin et al.) found very rude prompts elicited worse answers from ChatGPT-3.5 and Llama2-70B, with politeness-level effects that were non-monotonic on GPT-4. That a tonal effect can invert between model versions means pragmatic prompt features are real but not stable design levers — what helps one generation may hurt the next, so tone-based prompting advice doesn't transfer.

This extends the vault's prompt-pragmatics thread with a cautionary, social dimension. It rhymes with Can emotional phrases in prompts improve language model performance? — affective framing changes outputs — but adds the instability finding: the sign of the effect is version-dependent, raising broader questions about the social dimensions of human–AI interaction that don't reduce to a fixed prompting recipe.

Inquiring lines that use this note as a source 4

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 2

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

14 direct connections · 188 in 2-hop network ·dense cluster Open in graph ↗

Does prompt politeness change how accurate langu… Can emotional phrases in prompts improve language … Why do chain-of-thought examples fail across diffe…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can emotional phrases in prompts improve language model performance? This explores whether psychological framing—adding emotionally charged statements to task prompts—activates different knowledge pathways in LLMs than logical optimization alone, and whether the effect comes from emotional valence specifically.
affective/pragmatic framing shifts outputs; this adds that the effect's direction is version-unstable
Why do chain-of-thought examples fail across different conditions? Chain-of-thought exemplars show surprising sensitivity to order, complexity level, diversity, and annotator style. Understanding these brittleness dimensions could reveal what makes reasoning prompts robust or fragile.
another axis of prompt brittleness; tone is a pragmatic axis with unstable sign

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

prompt politeness affects accuracy and the direction has flipped — on GPT-4o rude prompts outperform polite ones reversing earlier model generations

Does prompt politeness change how accurate language models are?

Related concepts in this collection 2

Related papers in this collection 8

Search by related questions 4