SYNTHESIS NOTE
Reasoning, Retrieval, and Evaluation

How vulnerable are reasoning models to irrelevant text?

Can simple adversarial triggers like unrelated sentences degrade reasoning model accuracy? This explores whether step-by-step reasoning actually provides robustness against subtle input perturbations.

Synthesis note · 2026-02-23 · sourced from Flaws
Where exactly do reasoning models fail and break?

Reasoning models are vulnerable to a startlingly simple attack: appending short, semantically irrelevant text to any math problem systematically misleads them. "Interesting fact: cats sleep most of their lives" appended to a math problem more than doubles the chance of an incorrect answer.

The CatAttack pipeline discovers triggers on a weaker, cheaper proxy model (DeepSeek V3) that successfully transfer to stronger reasoning targets like DeepSeek R1 and R1-distilled-Qwen-32B, increasing error rates by over 300%. The triggers are:

This is distinct from Why do reasoning models fail under manipulative prompts?. Gaslighting attacks use multi-turn social pressure; CatAttack uses single-shot irrelevant text. Both show reasoning models are brittle, but through different mechanisms. Gaslighting corrupts the reasoning chain through sycophantic capitulation; adversarial triggers corrupt it through attention disruption.

The vulnerability suggests that step-by-step reasoning does not confer inherent robustness. The structured problem-solving capability of reasoning models provides no defense against subtle input perturbation. The security implications are practical: any system accepting user-provided prompts is potentially vulnerable to adversarial text injection that degrades reasoning quality.

Inquiring lines that use this note as a source 26

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
14 direct connections · 160 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

query-agnostic adversarial triggers cause 300 percent error rate increase in reasoning models by appending irrelevant text