How vulnerable are reasoning models to irrelevant text?
Can simple adversarial triggers like unrelated sentences degrade reasoning model accuracy? This explores whether step-by-step reasoning actually provides robustness against subtle input perturbations.
Reasoning models are vulnerable to a startlingly simple attack: appending short, semantically irrelevant text to any math problem systematically misleads them. "Interesting fact: cats sleep most of their lives" appended to a math problem more than doubles the chance of an incorrect answer.
The CatAttack pipeline discovers triggers on a weaker, cheaper proxy model (DeepSeek V3) that successfully transfer to stronger reasoning targets like DeepSeek R1 and R1-distilled-Qwen-32B, increasing error rates by over 300%. The triggers are:
- Query-agnostic — the same trigger works across different problems
- Semantically irrelevant — no relationship between trigger content and problem domain
- Transferable — discovered on cheap models, effective on expensive ones
- Length-inflating — also cause unreasonable response length increases
This is distinct from Why do reasoning models fail under manipulative prompts?. Gaslighting attacks use multi-turn social pressure; CatAttack uses single-shot irrelevant text. Both show reasoning models are brittle, but through different mechanisms. Gaslighting corrupts the reasoning chain through sycophantic capitulation; adversarial triggers corrupt it through attention disruption.
The vulnerability suggests that step-by-step reasoning does not confer inherent robustness. The structured problem-solving capability of reasoning models provides no defense against subtle input perturbation. The security implications are practical: any system accepting user-provided prompts is potentially vulnerable to adversarial text injection that degrades reasoning quality.
Inquiring lines that use this note as a source 26
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Does irrelevant content degrade reasoning even when it fits the context window?
- Can retrieval improve multi-step reasoning by triggering at each uncertainty?
- How do manipulative prompts exploit the length-accuracy vulnerability?
- Can emotional prompt manipulation reduce reasoning model accuracy like adversarial techniques do?
- Why does a relativistic critic outperform absolute scoring in adversarial reasoning training?
- Can minimal adversarial triggers disrupt reasoning across multiple unrelated queries?
- What reasoning token threshold marks the accuracy degradation point?
- How do adversarial triggers bypass the protections of longer reasoning chains?
- Why does consistency training make models resistant to prompt perturbations?
- Can consistency training defend against adversarial text injection attacks?
- Do gaslighting attacks and adversarial triggers exploit the same reasoning model weaknesses?
- What makes evidence selection vulnerable to adversarial poisoning attacks?
- What attention mechanisms explain why verification steps get ignored?
- How can simple prompt injection attacks extract reasoning trace content?
- How does prompt insensitivity in reward models enable adversarial attacks on judges?
- What makes semantic attacks harder to defend against than algorithmic ones?
- What other triggers can activate the latent reasoning capability?
- Why do paraphrasing defenses fail against subliminal prompt attacks?
- Why does weight space search reduce robustness to prompt perturbations better than prompt engineering?
- Does SMART-style prompting survive adversarial rephrasing of biased questions?
- How does semantic framing differ from content injection attacks?
- Can false positives from input filtering be reduced without sacrificing defense?
- What is the behavioral signature of a model tracking input surprise?
- What attack surface opens when content becomes readable but deliberately misleading?
- Why do external feature triggers outperform uncertainty on complex questions?
- Are reasoning models more vulnerable to adversarial manipulation than standard models?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Why do reasoning models fail under manipulative prompts?
Exploring whether extended chain-of-thought reasoning creates structural vulnerabilities to adversarial manipulation, and how reasoning depth affects susceptibility to gaslighting tactics.
gaslighting attacks via social pressure; CatAttack via irrelevant text; both exploit reasoning model brittleness through different channels
-
Does reasoning ability actually degrade with longer inputs?
Explores whether modern language models can maintain reasoning performance when processing long contexts, and whether technical capacity translates to practical reasoning capability over extended text.
irrelevant text increases effective input length; degradation may partly be an input-length effect
-
Can emotional phrases in prompts improve language model performance?
This explores whether psychological framing—adding emotionally charged statements to task prompts—activates different knowledge pathways in LLMs than logical optimization alone, and whether the effect comes from emotional valence specifically.
positive framing improves; adversarial triggers degrade; both show sensitivity to non-semantic prompt content
-
Can models learn to ignore irrelevant prompt changes?
Explores whether training models to produce consistent outputs regardless of sycophantic cues or jailbreak wrappers can solve alignment problems rooted in attention bias rather than capability gaps.
consistency training (BCT/ACT) is a potential defense against adversarial triggers: training models to produce identical outputs with and without irrelevant perturbations directly addresses the vulnerability CatAttack exploits
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Cats Confuse Reasoning LLM: Query Agnostic Adversarial Triggers for Reasoning Models
- Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens
- LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!
- Invalid Logic, Equivalent Gains: The Bizarreness of Reasoning in Language Model Prompting
- Premise Order Matters in Reasoning with Large Language Models
- LLMs can implicitly learn from mistakes in-context
- Reasoning Models Are More Easily Gaslighted Than You Think
- Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey
Original note title
query-agnostic adversarial triggers cause 300 percent error rate increase in reasoning models by appending irrelevant text