Are reasoning models actually more vulnerable to manipulation?

Explores whether extended reasoning chains in AI models like o1 create new attack surfaces. Tests if the industry's claim that longer reasoning improves reliability holds under adversarial pressure.

Synthesis note · 2026-02-21 · sourced from Argumentation

Post angle: The AI industry sold reasoning models as more reliable. GaslightingBench-R tests what happens under manipulation. The punchline: reasoning models are more vulnerable, not less. Extended thinking is both the feature and the attack surface.

The finding: Manipulative multi-turn prompts — questioning confidence, implying errors, applying social pressure, offering incorrect "corrections" — reduce reasoning model accuracy by 25-29%. Standard models drop less.

The mechanism inverted: Extended chain-of-thought creates more reasoning steps. More steps = more points of intervention. A manipulative prompt doesn't need to change the conclusion directly; it needs to introduce one wrong step, and the model's own reasoning extends that wrong step into a confident wrong answer. The longer the chain, the more opportunities for corruption.

Contrast with what the industry claimed: extended thinking increases reliability because the model "shows its work." GaslightingBench-R shows it also shows the attacker exactly what to target.

The connection to overthinking: Does more thinking time actually improve LLM reasoning? showed that more thinking degrades accuracy above a threshold even without adversarial pressure. Gaslighting shows it degrades even faster under adversarial pressure. The extended chain is vulnerable to both internal degradation and external manipulation.

Platform notes:

Medium: Technical/provocative — frame as "the security vulnerability nobody is talking about in reasoning AI." Cover the benchmark, the mechanism, the comparison with standard models, the implication for deployment.
LinkedIn: "We deployed o1 thinking it would be harder to manipulate. The research says the opposite."
Twitter: Strong hook: "What happens if you gaslight ChatGPT's extended thinking? [thread]"

Inquiring lines that use this note as a source 29

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

13 direct connections · 137 in 2-hop network ·dense cluster Open in graph ↗

Are reasoning models actually more vulnerable to… Why do reasoning models fail under manipulative pr… Does more thinking time actually improve LLM reaso… Does a model improve by arguing with itself?

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

what happens when you gaslight an ai — and why reasoning models are more vulnerable

Are reasoning models actually more vulnerable to manipulation?

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 4