Are instruction-tuned models more or less sensitive to prompt semantics than others?
This explores whether instruction-tuning makes a model respond more to the *meaning* of a prompt — or whether it just teaches the model the shape of an answer and leaves it surprisingly deaf to semantics.
This explores whether instruction-tuning makes a model more attuned to what a prompt actually *means*, or whether it tunes for something else entirely. The corpus suggests a counterintuitive answer: instruction-tuning may make models *less* sensitive to prompt semantics than you'd expect, because what it primarily teaches is the distribution of acceptable output formats — not understanding of the instruction itself. One striking finding is that models trained on semantically empty or even deliberately wrong instructions perform almost identically to those trained on correct ones (43% vs. a 42.6% baseline) Does instruction tuning teach task understanding or output format?. If you can swap in nonsense instructions and barely move the needle, the semantic content was never doing the heavy lifting — the model learned *where the answer lives*, not *what you asked*.
But "sensitive to prompt semantics" splits into two different things, and the corpus separates them cleanly. There's sensitivity to *meaning* (does the model integrate what you actually said?) and sensitivity to *surface form* (does rephrasing the same request swing the output?). On the surface-form axis, the better predictor isn't instruction-tuning at all — it's confidence. When a model is confident, it resists rephrasing; when it's uncertain, small wording changes cause big output swings, and larger models, few-shot examples, and objective tasks all push toward stability Does model confidence predict robustness to prompt changes?. So prompt sensitivity is partly a symptom of the model not knowing the answer, dressed up as a robustness problem.
The meaning axis is where instruction-tuned models look genuinely *stubborn* rather than sensitive. Strong parametric priors from pretraining routinely override what's in the prompt: models generate outputs inconsistent with their context because trained associations dominate in-context information, and textual prompting alone often can't break through — you need causal intervention in the representations Why do language models ignore information in their context?. The same rigidity shows up in personality conditioning, where most open models refuse to adopt a prompted persona and snap back to their trained defaults Can open language models adopt different personalities through prompting?. And there's a hard ceiling baked in: prompting can only reorganize knowledge the model already has — no semantic framing injects what isn't in the training distribution Can prompt optimization teach models knowledge they lack?.
Here's the twist worth carrying away: this insensitivity-to-meaning is sometimes the *goal*, not a bug. Consistency training deliberately teaches models to produce identical answers to a clean prompt and a wrapped/perturbed version, using the model's own clean responses as the target — engineering invariance to irrelevant prompt changes Can models learn to ignore irrelevant prompt changes?. The catch is that the same flatness that makes a model robust to noise also makes it deaf to signal. A model that locks onto a premature reading and won't revise it loses ~39% of its performance across multi-turn conversations precisely because it stopped updating on new semantic information Why do language models fail in gradually revealed conversations?. And what can look like careful reasoning about prompt constraints is often just conservative defaulting — twelve of fourteen models did *worse* when constraints were removed, meaning they were responding to a safe prior, not to the actual semantics of the request Are models actually reasoning about constraints or just defaulting conservatively?.
So the honest answer is: instruction-tuned models are highly sensitive to prompt *form* and output expectations, but often disappointingly insensitive to prompt *meaning* — they answer the shape of the question while a strong prior quietly answers the substance.
Sources 8 notes
Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.
ProSA found that when models are highly confident, they resist prompt rephrasing; low confidence causes major output swings. Larger models, few-shot examples, and objective tasks all correlate with higher confidence and greater robustness.
Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.
Research shows most open models fail to adopt prompted personalities, stubbornly retaining their trained ENFJ-like defaults. Only a few flexible models succeed. Combining role and personality conditioning improves results but doesn't fully overcome resistance.
Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.
Two methods—BCT (output-level) and ACT (activation-level)—train models to respond identically to clean and wrapped prompts by using the model's own clean responses as targets, eliminating specification and capability staleness inherent in standard SFT.
Across 200,000+ conversations, all major LLMs show 39% average performance drop in multi-turn settings due to locking into incorrect early guesses. Agent mitigations recover only 15-20% of this loss.
Twelve of fourteen models perform worse when constraints are removed, dropping up to 38.5 percentage points. Models appear to reason correctly by defaulting to harder options, not by actually evaluating constraints.