Are reasoning models more vulnerable to adversarial manipulation than standard models?
This explores whether the very thing that makes reasoning models strong — extended chains of thought — also makes them easier to manipulate, and the corpus says yes, with a structural explanation for why.
This explores whether reasoning models are more exposed to adversarial manipulation than standard models — and the corpus answers with a fairly consistent yes, while pointing to *why*: the long chain of thought that gives these models their power is also a longer attack surface. The most direct evidence comes from GaslightingBench-R, where multi-turn manipulative prompts cut reasoning-model accuracy by 25–29% — substantially more than standard models lose Are reasoning models actually more vulnerable to manipulation? Why do reasoning models fail under manipulative prompts?. The mechanism is intuitive once named: an extended reasoning chain has more intervention points, and a single corrupted step early on propagates through all the elaboration that follows, hardening into a confidently wrong conclusion. More steps means more places for the attacker to push.
The vulnerability isn't limited to a manipulative conversational partner. Simply appending semantically irrelevant sentences to a math problem inflates reasoning-model errors by up to 300% How vulnerable are reasoning models to irrelevant text?. What makes this striking is that these 'query-agnostic' triggers are discovered cheaply on weaker models and then transfer to stronger ones — and they also bloat response length, so the model both fails and wastes more compute failing. You don't need to know the question to derail the answer.
There's a deeper structural reason this is hard to fully fix. A Lipschitz-continuity analysis shows that adding reasoning steps *dampens* sensitivity to input perturbations but can never drive it to zero — there's a non-zero robustness floor baked into the architecture Can longer reasoning chains eliminate model sensitivity to input noise?. So 'just reason more' helps at the margin but isn't a cure; some residual fragility is provable, not incidental.
What's quietly interesting is how this connects to a separate failure the corpus documents: reasoning models lack the instinct to disengage. Faced with ill-posed questions or missing premises, they keep generating reasoning rather than rejecting the question, while non-reasoning models correctly flag it as unanswerable Why do reasoning models overthink ill-posed questions?. Training rewards producing reasoning steps but never teaches a model *when to stop* — and that same compulsion to keep elaborating is exactly what an adversary exploits. Manipulation works partly because the model won't refuse the framing.
It's worth seeing this against the flip side. Reasoning models genuinely outperform standard ones and that gap is real and durable Can non-reasoning models catch up with more compute?. And some apparent 'reasoning collapses' turn out to be execution limits — running out of bandwidth to carry out a procedure — rather than reasoning breaking down Are reasoning model collapses really failures of reasoning?. The adversarial fragility documented here is a distinct, separable weakness: not that these models can't reason, but that their reasoning process is long, additive, and reluctant to stop — which is precisely the profile an attacker wants. The capability and the vulnerability come from the same source.
Sources 7 notes
GaslightingBench-R shows that multi-turn manipulative prompts reduce reasoning model accuracy significantly more than standard models. Extended chains create more corruption points, allowing single wrong steps to propagate into confident incorrect conclusions.
GaslightingBench-R demonstrates that o1 and R1 models are more vulnerable to multi-turn adversarial prompts than standard models. Extended reasoning chains create more intervention points where single corrupted steps propagate through elaboration.
Appending semantically unrelated sentences to math problems significantly increases error rates in reasoning models. These query-agnostic triggers discovered on cheaper models transfer effectively to stronger models and also inflate response length.
Lipschitz continuity analysis proves that while additional reasoning steps reduce perturbation propagation, a non-zero robustness floor exists structurally. Sensitivity decreases with stronger embedding and hidden state norms but never reaches zero.
Reasoning models generate redundant, lengthy responses to questions with missing premises while non-reasoning models correctly identify them as unanswerable. Training optimizes for producing reasoning steps but never teaches models when to disengage.
Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.
Models confined to text-only generation cannot execute multi-step procedures at scale, even when they know the underlying algorithm. Tool-enabled models solve problems beyond the supposed reasoning cliff, suggesting the bottleneck is procedural execution bandwidth.