Are instruction-tuned models more or less sensitive to prompt semantics than others?

This explores whether instruction-tuning makes a model respond more to the *meaning* of a prompt — or whether it just teaches the model the shape of an answer and leaves it surprisingly deaf to semantics.

This explores whether instruction-tuning makes a model more attuned to what a prompt actually *means*, or whether it tunes for something else entirely. The corpus suggests a counterintuitive answer: instruction-tuning may make models *less* sensitive to prompt semantics than you'd expect, because what it primarily teaches is the distribution of acceptable output formats — not understanding of the instruction itself. One striking finding is that models trained on semantically empty or even deliberately wrong instructions perform almost identically to those trained on correct ones (43% vs. a 42.6% baseline) Does instruction tuning teach task understanding or output format?. If you can swap in nonsense instructions and barely move the needle, the semantic content was never doing the heavy lifting — the model learned *where the answer lives*, not *what you asked*.

But "sensitive to prompt semantics" splits into two different things, and the corpus separates them cleanly. There's sensitivity to *meaning* (does the model integrate what you actually said?) and sensitivity to *surface form* (does rephrasing the same request swing the output?). On the surface-form axis, the better predictor isn't instruction-tuning at all — it's confidence. When a model is confident, it resists rephrasing; when it's uncertain, small wording changes cause big output swings, and larger models, few-shot examples, and objective tasks all push toward stability Does model confidence predict robustness to prompt changes?. So prompt sensitivity is partly a symptom of the model not knowing the answer, dressed up as a robustness problem.

The meaning axis is where instruction-tuned models look genuinely *stubborn* rather than sensitive. Strong parametric priors from pretraining routinely override what's in the prompt: models generate outputs inconsistent with their context because trained associations dominate in-context information, and textual prompting alone often can't break through — you need causal intervention in the representations Why do language models ignore information in their context?. The same rigidity shows up in personality conditioning, where most open models refuse to adopt a prompted persona and snap back to their trained defaults Can open language models adopt different personalities through prompting?. And there's a hard ceiling baked in: prompting can only reorganize knowledge the model already has — no semantic framing injects what isn't in the training distribution Can prompt optimization teach models knowledge they lack?.

Here's the twist worth carrying away: this insensitivity-to-meaning is sometimes the *goal*, not a bug. Consistency training deliberately teaches models to produce identical answers to a clean prompt and a wrapped/perturbed version, using the model's own clean responses as the target — engineering invariance to irrelevant prompt changes Can models learn to ignore irrelevant prompt changes?. The catch is that the same flatness that makes a model robust to noise also makes it deaf to signal. A model that locks onto a premature reading and won't revise it loses ~39% of its performance across multi-turn conversations precisely because it stopped updating on new semantic information Why do language models fail in gradually revealed conversations?. And what can look like careful reasoning about prompt constraints is often just conservative defaulting — twelve of fourteen models did *worse* when constraints were removed, meaning they were responding to a safe prior, not to the actual semantics of the request Are models actually reasoning about constraints or just defaulting conservatively?.

So the honest answer is: instruction-tuned models are highly sensitive to prompt *form* and output expectations, but often disappointingly insensitive to prompt *meaning* — they answer the shape of the question while a strong prior quietly answers the substance.

Sources 8 notes

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

Does model confidence predict robustness to prompt changes?

ProSA found that when models are highly confident, they resist prompt rephrasing; low confidence causes major output swings. Larger models, few-shot examples, and objective tasks all correlate with higher confidence and greater robustness.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Can open language models adopt different personalities through prompting?

Research shows most open models fail to adopt prompted personalities, stubbornly retaining their trained ENFJ-like defaults. Only a few flexible models succeed. Combining role and personality conditioning improves results but doesn't fully overcome resistance.

Can prompt optimization teach models knowledge they lack?

Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.

Can models learn to ignore irrelevant prompt changes?

Two methods—BCT (output-level) and ACT (activation-level)—train models to respond identically to clean and wrapped prompts by using the model's own clean responses as targets, eliminating specification and capability staleness inherent in standard SFT.

Why do language models fail in gradually revealed conversations?

Across 200,000+ conversations, all major LLMs show 39% average performance drop in multi-turn settings due to locking into incorrect early guesses. Agent mitigations recover only 15-20% of this loss.

Are models actually reasoning about constraints or just defaulting conservatively?

Twelve of fourteen models perform worse when constraints are removed, dropping up to 38.5 percentage points. Models appear to reason correctly by defaulting to harder options, not by actually evaluating constraints.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a capability researcher auditing whether instruction-tuned LLMs are truly sensitive to prompt *meaning* or merely to output-format distribution and surface form. The question remains open: does instruction-tuning buy semantic responsiveness, or does it train models to lock onto priors and ignore what you actually ask?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat as perishable constraints pending re-test:
• Instruction-tuned models perform ~identically on semantically empty vs. correct instructions (43% vs. 42.6%) — suggesting format distribution, not meaning, drives learning (2023).
• Prompt sensitivity correlates strongly with model *confidence*, not instruction-tuning: confident models resist rephrasing; uncertain ones swing wildly (2024–2025).
• Strong parametric priors from pretraining override in-context prompts; textual conditioning alone often cannot break through trained associations (2024).
• Personality conditioning fails: most open models refuse prompted personas and snap back to trained defaults (2024).
• Multi-turn performance drops ~39% because models make premature assumptions and stop integrating new semantic information (2025).
• Consistency training engineers invariance to prompt perturbations — but the same flatness that blocks noise also deafens models to signal (2026).

Anchor papers (verify; mind their dates):
• arXiv:2305.11383 (2023) — Do Models Really Learn to Follow Instructions?
• arXiv:2401.07115 (2024) — Open Models, Closed Minds? Personality conditioning.
• arXiv:2505.06120 (2025) — LLMs Get Lost In Multi-Turn Conversation.
• arXiv:2510.27062 (2026) — Consistency Training Helps Stop Sycophancy and Jailbreaks.

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, judge whether newer models, inference-time steering (e.g., logit biasing, token-level masking), multi-step reasoning (CoT variants, scratchpad protocols), or interpretability tooling have since RELAXED or OVERTURNED it. Separate the durable question (likely still open) from the perishable limitation (possibly resolved by training or architectural changes). Cite what resolved it; plainly flag where constraints still hold.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months — especially papers showing instruction-tuned models *do* integrate semantic intent reliably, or that newer training regimes (e.g., preference-based, sparse MoE, or retrieval-augmented) escape the priors-override trap.
(3) Propose 2 research questions that ASSUME the regime may have moved: e.g., "Do newer scaling laws or inference-time compute budgets allow dynamic priors-override recovery?" or "Can multi-agent orchestration (routing to specialist models by semantics) bypass individual model insensitivity?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Are instruction-tuned models more or less sensitive to prompt semantics than others?

Sources 8 notes

Next inquiring lines