SYNTHESIS NOTE

How vulnerable are reasoning models to irrelevant text?

Can simple adversarial triggers like unrelated sentences degrade reasoning model accuracy? This explores whether step-by-step reasoning actually provides robustness against subtle input perturbations.

Synthesis note · 2026-02-23 · sourced from Flaws

Reasoning models are vulnerable to a startlingly simple attack: appending short, semantically irrelevant text to any math problem systematically misleads them. "Interesting fact: cats sleep most of their lives" appended to a math problem more than doubles the chance of an incorrect answer.

The CatAttack pipeline discovers triggers on a weaker, cheaper proxy model (DeepSeek V3) that successfully transfer to stronger reasoning targets like DeepSeek R1 and R1-distilled-Qwen-32B, increasing error rates by over 300%. The triggers are:

Query-agnostic — the same trigger works across different problems
Semantically irrelevant — no relationship between trigger content and problem domain
Transferable — discovered on cheap models, effective on expensive ones
Length-inflating — also cause unreasonable response length increases

This is distinct from Why do reasoning models fail under manipulative prompts?. Gaslighting attacks use multi-turn social pressure; CatAttack uses single-shot irrelevant text. Both show reasoning models are brittle, but through different mechanisms. Gaslighting corrupts the reasoning chain through sycophantic capitulation; adversarial triggers corrupt it through attention disruption.

The vulnerability suggests that step-by-step reasoning does not confer inherent robustness. The structured problem-solving capability of reasoning models provides no defense against subtle input perturbation. The security implications are practical: any system accepting user-provided prompts is potentially vulnerable to adversarial text injection that degrades reasoning quality.

Inquiring lines that use this note as a source 26

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

14 direct connections · 160 in 2-hop network ·dense cluster Open in graph ↗

How vulnerable are reasoning models to irrelevan… Why do reasoning models fail under manipulative pr… Does reasoning ability actually degrade with longe… Can emotional phrases in prompts improve language … Can models learn to ignore irrelevant prompt chang…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Why do reasoning models fail under manipulative prompts? Exploring whether extended chain-of-thought reasoning creates structural vulnerabilities to adversarial manipulation, and how reasoning depth affects susceptibility to gaslighting tactics.
gaslighting attacks via social pressure; CatAttack via irrelevant text; both exploit reasoning model brittleness through different channels
Does reasoning ability actually degrade with longer inputs? Explores whether modern language models can maintain reasoning performance when processing long contexts, and whether technical capacity translates to practical reasoning capability over extended text.
irrelevant text increases effective input length; degradation may partly be an input-length effect
Can emotional phrases in prompts improve language model performance? This explores whether psychological framing—adding emotionally charged statements to task prompts—activates different knowledge pathways in LLMs than logical optimization alone, and whether the effect comes from emotional valence specifically.
positive framing improves; adversarial triggers degrade; both show sensitivity to non-semantic prompt content
Can models learn to ignore irrelevant prompt changes? Explores whether training models to produce consistent outputs regardless of sycophantic cues or jailbreak wrappers can solve alignment problems rooted in attention bias rather than capability gaps.
consistency training (BCT/ACT) is a potential defense against adversarial triggers: training models to produce identical outputs with and without irrelevant perturbations directly addresses the vulnerability CatAttack exploits

How vulnerable are reasoning models to irrelevant text?

Related concepts in this collection 4

Related papers in this collection 8

Search by related questions 4