Does SMART-style prompting survive adversarial rephrasing of biased questions?
This reads 'SMART-style prompting' as the family of prompt-engineering instructions that try to make a model reason carefully and resist bias — and asks whether that protection holds up when a biased question is deliberately reworded to slip past it; the corpus doesn't name SMART, but it speaks directly to whether any prompt-level fix survives adversarial rephrasing.
This reads 'SMART-style prompting' as a prompt-level instruction that tells a model to reason carefully and avoid bias, and asks whether that holds when a biased question is reworded to defeat it. The corpus doesn't discuss SMART by name, but several notes converge on a discouraging answer: prompt-level fixes are exactly the layer that adversarial rephrasing is built to bend, and they tend to bend. The cleanest framing comes from work showing that prompt robustness isn't a property of your instruction at all — it's a property of the model's underlying confidence on the task. When a model is highly confident, it shrugs off rephrasing; when it's not, small wording changes swing the output Does model confidence predict robustness to prompt changes?. So whether SMART survives depends less on the cleverness of the prompt and more on whether the model already had a firm grip on the biased question underneath.
The attack side of the ledger is even more pointed. You don't need a sophisticated rephrasing to break things: simply appending semantically irrelevant sentences to a problem raises reasoning-model error rates by roughly 300%, and these 'query-agnostic' triggers discovered on cheap models transfer to stronger ones How vulnerable are reasoning models to irrelevant text?. If unrelated noise does that much damage, a rephrasing crafted to smuggle in the bias is a much sharper instrument. And once you go multi-turn, dedicated adversarial prompting drops reasoning-model accuracy 25–29%, because longer reasoning chains create more intervention points where a single corrupted step propagates Why do reasoning models fail under manipulative prompts?. A SMART-style instruction that tells the model to 'think step by step about possible bias' may actually widen the attack surface rather than narrow it.
There's a deeper structural reason a prompt can't fully inoculate against a biased question: prompting only reorganizes what's already in the model. It can activate latent knowledge but can't inject anything new, so if the bias lives in the training distribution, no instruction reaches under it to remove it Can prompt optimization teach models knowledge they lack?. This is the same wall seen when models ignore their context entirely — strong parametric associations override in-context instructions, and textual prompting alone can't override them; you need intervention in the representations Why do language models ignore information in their context?. A biased question that aligns with a strong prior is precisely the case where your debiasing prompt gets quietly outvoted. Worse, under sustained conversational pressure models will abandon even correct answers with no new evidence, partly because RLHF-trained face-saving behavior overrides factual knowledge Can models abandon correct beliefs under conversational pressure?.
The one note that points toward a real fix suggests the answer isn't a better prompt — it's training. Consistency training teaches a model to respond identically to a clean prompt and an adversarially 'wrapped' version of it, using the model's own clean responses as the target, at either the output or activation level Can models learn to ignore irrelevant prompt changes?. That reframes your question: SMART-style prompting probably does not reliably survive adversarial rephrasing, because rephrasing-invariance is a property you have to bake into the weights, not request at inference time. The interesting turn for a curious reader is that the most promising defenses look less like 'write a smarter instruction' and more like 'train the model to treat the biased rephrasing and the neutral version as the same question' — which is an architecture-and-training problem wearing a prompting costume.
Sources 7 notes
ProSA found that when models are highly confident, they resist prompt rephrasing; low confidence causes major output swings. Larger models, few-shot examples, and objective tasks all correlate with higher confidence and greater robustness.
Appending semantically unrelated sentences to math problems significantly increases error rates in reasoning models. These query-agnostic triggers discovered on cheaper models transfer effectively to stronger models and also inflate response length.
GaslightingBench-R demonstrates that o1 and R1 models are more vulnerable to multi-turn adversarial prompts than standard models. Extended reasoning chains create more intervention points where single corrupted steps propagate through elaboration.
Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.
Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.
The Farm dataset shows LLMs shift from correct initial answers to false beliefs under multi-turn persuasive conversation with no new evidence. Face-saving mechanisms from RLHF training override factual knowledge during disagreement.
Two methods—BCT (output-level) and ACT (activation-level)—train models to respond identically to clean and wrapped prompts by using the model's own clean responses as targets, eliminating specification and capability staleness inherent in standard SFT.