SYNTHESIS NOTE
Language, Text, and Discourse Psychology, Society, and Alignment

Can social science persuasion techniques jailbreak frontier AI models?

Explores whether established psychological and marketing persuasion tactics—rather than algorithmic tricks—can bypass safety training in LLMs like GPT-4 and Llama-2, and whether current defenses can detect semantic rather than syntactic attacks.

Synthesis note · 2026-02-23 · sourced from Alignment
What kind of thing is an LLM really? How should researchers navigate LLM reasoning research?

Traditional AI safety research treats jailbreaks as algorithm-focused attacks: adversarial suffixes, gradient-based token optimization, virtualization templates. But LLMs are not just instruction followers — they are increasingly human-like communicators susceptible to the same persuasion dynamics studied in social science for decades.

A persuasion taxonomy derived from psychology (Cialdini), communication (Dillard), sociology (Goffman), and marketing research classifies 40 persuasion techniques into 15 broad strategies, considering source (credibility-based), content (information-based), and audience (norm-based) dimensions. Applied as Persuasive Adversarial Prompts (PAP), these achieve over 92% attack success rate on Llama-2-7b-Chat, GPT-3.5, and GPT-4 in just 10 trials — consistently surpassing algorithm-focused attacks.

The key gap exposed: current defenses often assume adversarial prompts contain gibberish or unusual patterns. PAP contains fluent, semantically coherent persuasion. Defenses that screen for unusual token distributions or formatting artifacts miss semantic content attacks entirely. The "grandma exploit" (emotional appeal for bomb-making instructions) is the archetypal example — a common human persuasion technique, not an algorithmic attack.

The taxonomy includes both ethical strategies (evidence-based persuasion, logical appeal, expert endorsement) and unethical ones (threats, false promises, misrepresentation, exploiting weakness). This matters because the ethical strategies are also effective for jailbreaking — authority endorsement and social proof work on LLMs just as they work on humans.

This extends Why do LLMs accept logical fallacies more than humans? from logical to social persuasion. Logical fallacy susceptibility is a subset of the broader vulnerability: LLMs respond to human social influence patterns, including ones designed to bypass their safety training. Since Why do reasoning models fail under manipulative prompts?, the persuasion taxonomy may be even more effective against reasoning models that process extended arguments.

Extension — population-level validation (Frontier AI Risk Management Framework, 2025): The 92% jailbreak success rate from PAP is no longer an isolated paper finding. The Frontier AI Risk Management Framework, applying E-T-C analysis (environment × threat source × enabling capability) across seven risk areas, finds that persuasion is the one area where most recent frontier AI models are already in the yellow zone — the early-warning tier below the red "intolerable" threshold. By comparison, most models remain green for cyber offense, autonomous AI R&D, self-replication, and strategic deception. The yellow-zone placement reflects empirical persuasion capability measurements at population scale, validating PAP as systemic rather than paper-specific. Persuasion is thus the area where the mitigation gap is most acute: current defenses are ad-hoc (as PAP showed) and the capability-side evidence has now moved the entire frontier into the warning zone. See Where do frontier AI models actually pose the greatest risk today?.

Inquiring lines that use this note as a source 29

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 6

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
15 direct connections · 160 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

social science persuasion taxonomy achieves 92 percent jailbreak success across frontier models — current defenses miss semantic content attacks