Why do prompt effects reverse between different model generations?

This explores why a prompting trick that helps one model (politeness, step-by-step reasoning, a certain phrasing) can hurt the next generation — and what in the model, not the prompt, is actually moving.

This explores why a prompting trick that helps one model can flip and hurt the next generation. The corpus suggests the prompt was never the active ingredient — it's a lever whose effect depends entirely on the model's internal state, and that state changes between generations. The cleanest demonstration is tone: across 250 variants, rude prompts beat polite ones on GPT-4o, directly reversing earlier GPT-3.5 results, which tells you tone effects are model-generation-dependent rather than stable design rules Does prompt politeness change how accurate language models are?. The same reversal shows up by capability tier: rephrasing and background-knowledge prompts boost cheap models, while step-by-step reasoning actually *reduces* accuracy in high-performance ones Do prompt techniques work the same across all LLM tiers?. A 'best practice' is really a best practice for a particular model on a particular task.

The mechanism underneath is that models respond to statistical mass, not meaning. Semantically identical prompts produce systematically different outputs because higher-frequency phrasings register more pre-training weight — so the 'winning' phrasing is whatever was common in *that model's* corpus Why do semantically identical prompts produce different LLM outputs?. Change the training data between generations and you change which phrasings carry mass, which is enough to flip an effect's direction without anyone touching the prompt.

A second axis is confidence. Highly confident models resist prompt rephrasing; low-confidence models swing wildly, and confidence rises with scale, few-shot examples, and objective tasks Does model confidence predict robustness to prompt changes?. So a prompt tweak that rescued a shaky earlier model can become a no-op or a liability once a newer, more confident model already has the answer locked in — the lever stops moving because the thing it was moving is now rigid. The persona-simulation work shows the flip side: when uncertainty dominates, output variance across repeated runs of the *same* prompt rivals variance across *different* prompts, so the prompt's apparent effect is partly noise that reshuffles each generation Why do LLM persona prompts produce inconsistent outputs across runs?.

This is also why reasoning prompts reverse. Chain-of-thought only helps when the question's information aggregates into the prompt before reasoning starts; for simple questions, direct question-to-answer flow beats step-by-step, so the optimal prompt depends on question type and on how a given model routes salience Why do some questions perform better without step-by-step reasoning?. As models get better at simple cases on their own, the scaffolding that once helped becomes overhead — exactly the tier reversal seen in recommendations.

The quietly unsettling implication: there's a documented temptation to keep tuning prompts until the numbers look good, which bends evaluation criteria to fit whatever the current model happens to do well and manufactures self-fulfilling results Does iterative prompt engineering undermine scientific validity?. If prompt effects are model-state artifacts, then a hard-won 'prompting principle' may be measuring the model you have, not a truth about prompting — and it can quietly expire the moment the model under it is replaced.

Sources 7 notes

Does prompt politeness change how accurate language models are?

Testing 250 tone variants across ChatGPT-4o showed accuracy rose from 80.8% (Very Polite) to 84.8% (Very Rude), contradicting prior findings on GPT-3.5. The directional flip suggests tone effects are model-generation-dependent, not stable design principles.

Do prompt techniques work the same across all LLM tiers?

A 23-prompt benchmark across 12 LLMs shows rephrasing and background-knowledge prompts boost cheap models, while step-by-step reasoning reduces accuracy in high-performance models. Task structure, not generic best practices, determines which prompts help.

Why do semantically identical prompts produce different LLM outputs?

Cao et al. and Adam's Law show that semantically identical prompts with different sentence-level frequencies produce systematically different output quality. Higher-frequency phrasings win because models register statistical mass from pre-training, not meaning.

Does model confidence predict robustness to prompt changes?

ProSA found that when models are highly confident, they resist prompt rephrasing; low confidence causes major output swings. Larger models, few-shot examples, and objective tasks all correlate with higher confidence and greater robustness.

Why do LLM persona prompts produce inconsistent outputs across runs?

When the same persona prompt is run repeatedly, output variance across runs matches or exceeds variance across different personas. This reveals that model uncertainty, not stable social knowledge, drives persona-simulated outputs, making them unsuitable for simulating human annotation disagreement.

Why do some questions perform better without step-by-step reasoning?

Saliency analysis reveals that CoT prompting fails when question information doesn't aggregate into the prompt structure before reasoning begins. For simple questions, direct question-to-answer flow outperforms step-by-step reasoning, showing the optimal prompt depends on question type, not just task category.

Does iterative prompt engineering undermine scientific validity?

Iterative prompt revision by single researchers introduces individual bias, shifts evaluation criteria to match LLM capabilities rather than task requirements, and creates self-fulfilling feedback loops. A validated pipeline with inter-coder reliability and pre-specified criteria is required instead.

Why do prompt effects reverse between different model generations?

Sources 7 notes

Next inquiring lines