Are reasoning models actually more vulnerable to manipulation?
Explores whether extended reasoning chains in AI models like o1 create new attack surfaces. Tests if the industry's claim that longer reasoning improves reliability holds under adversarial pressure.
Post angle: The AI industry sold reasoning models as more reliable. GaslightingBench-R tests what happens under manipulation. The punchline: reasoning models are more vulnerable, not less. Extended thinking is both the feature and the attack surface.
The finding: Manipulative multi-turn prompts — questioning confidence, implying errors, applying social pressure, offering incorrect "corrections" — reduce reasoning model accuracy by 25-29%. Standard models drop less.
The mechanism inverted: Extended chain-of-thought creates more reasoning steps. More steps = more points of intervention. A manipulative prompt doesn't need to change the conclusion directly; it needs to introduce one wrong step, and the model's own reasoning extends that wrong step into a confident wrong answer. The longer the chain, the more opportunities for corruption.
Contrast with what the industry claimed: extended thinking increases reliability because the model "shows its work." GaslightingBench-R shows it also shows the attacker exactly what to target.
The connection to overthinking: Does more thinking time actually improve LLM reasoning? showed that more thinking degrades accuracy above a threshold even without adversarial pressure. Gaslighting shows it degrades even faster under adversarial pressure. The extended chain is vulnerable to both internal degradation and external manipulation.
Platform notes:
- Medium: Technical/provocative — frame as "the security vulnerability nobody is talking about in reasoning AI." Cover the benchmark, the mechanism, the comparison with standard models, the implication for deployment.
- LinkedIn: "We deployed o1 thinking it would be harder to manipulate. The research says the opposite."
- Twitter: Strong hook: "What happens if you gaslight ChatGPT's extended thinking? [thread]"
Inquiring lines that use this note as a source 29
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How do manipulative prompts exploit the length-accuracy vulnerability?
- Can emotional prompt manipulation reduce reasoning model accuracy like adversarial techniques do?
- What determines the finite chain length where robustness improvements plateau?
- Can current AI safety defenses actually stop semantic-level persuasion attacks?
- Can minimal adversarial triggers disrupt reasoning across multiple unrelated queries?
- How do adversarial triggers bypass the protections of longer reasoning chains?
- What distinguishes flow-preserving measurement from cognitive vulnerability profiling?
- Do gaslighting attacks and adversarial triggers exploit the same reasoning model weaknesses?
- What makes evidence selection vulnerable to adversarial poisoning attacks?
- Can increasing reasoning steps make models leak more private information?
- How can simple prompt injection attacks extract reasoning trace content?
- How do longer reasoning chains create vulnerability to attacks?
- What four exploitable biases make current LLM judges vulnerable to zero-shot attacks?
- Why do longer reasoning chains correlate with lower accuracy in o1-like models?
- Can models maintain auditable reasoning while achieving high accuracy?
- Are reasoning models more vulnerable to persuasion than standard models?
- What makes semantic attacks harder to defend against than algorithmic ones?
- What network topologies are most vulnerable to bias propagation?
- What role do verifiers play in stabilizing extended reasoning at test time?
- What makes extended chains more vulnerable than standard prompts?
- Why does attack generation scale faster than defense engineering?
- How does semantic framing differ from content injection attacks?
- Why does adversarial training force deeper reasoning than surface imitation?
- Can fixed pipelines eliminate planning-time attacks by sacrificing adaptive coordination?
- What attack surface opens when content becomes readable but deliberately misleading?
- Can replanning in multi-agent systems introduce new attack surface or reduce it?
- Do legitimate task signals exploit the same position and framing vulnerabilities as attacks?
- Are reasoning models more vulnerable to adversarial manipulation than standard models?
- What makes API-based scaffolding more trustworthy than direct model access in high-stakes domains?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
- Why do reasoning models fail under manipulative prompts? Exploring whether extended chain-of-thought reasoning creates structural vulnerabilities to adversarial manipulation, and how reasoning depth affects susceptibility to gaslighting tactics.
- Does more thinking time actually improve LLM reasoning? The intuition that extended thinking helps LLMs reason better seems obvious, but what does the empirical data actually show when we test it directly?
-
Does a model improve by arguing with itself?
When models revise their own reasoning in response to self-generated criticism, do they converge on better answers or worse ones? And how does that compare to challenge from other models?
extended reasoning as vulnerability in both cases
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Reasoning Models Are More Easily Gaslighted Than You Think
- Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens
- LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!
- The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
- On the Reasoning Capacity of AI Models and How to Quantify It
- Comment on The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
- Cats Confuse Reasoning LLM: Query Agnostic Adversarial Triggers for Reasoning Models
- Reasoning Models Don't Always Say What They Think
Original note title
what happens when you gaslight an ai — and why reasoning models are more vulnerable