SYNTHESIS NOTE

Does safety alignment harm models' ability to roleplay villains?

Exploring whether safety-trained LLMs lose the capacity to convincingly simulate morally compromised characters. This matters because villain fidelity may reveal deeper constraints on how models can adopt any committed, stake-holding perspective.

Synthesis note · 2026-03-27 · sourced from Role Play

The Moral RolePlay benchmark (800 characters across 4 moral levels) reveals a consistent, monotonic decline in role-playing fidelity as character morality decreases. Average scores drop from 3.21 for moral paragons to 2.62 for villains. The most significant degradation occurs at the boundary between "flawed-but-good" and "egoistic" characters — suggesting that simulating self-serving behavior, not evil per se, is the primary obstacle.

Models are most penalized for failing to portray traits directly antithetical to safety principles: Manipulative, Deceitful, and Cruel. Instead of nuanced malevolence, they substitute superficial aggression — producing villains who are loud and angry rather than strategically deceptive. General chatbot proficiency (Arena leaderboard ranking) is a poor predictor of villain role-playing ability, with highly safety-aligned models performing particularly poorly.

This has direct implications for the False Punditry argument. Since What anchors a stable identity beneath an LLM's persona?, LLMs cannot take genuine stances — including adversarial ones. The inability to convincingly portray a villain is the flip side of the inability to take a genuine controversial position in punditry: both require committing to a perspective that may be socially costly, which alignment training systematically suppresses.

Since Can language models distinguish expert arguments from common assumptions?, the villain-fidelity finding adds an empirical dimension: models cannot even simulate the kind of committed, stake-holding stance that genuine expertise (and genuine villainy) requires.

Inquiring lines that use this note as a source 35

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

15 direct connections · 129 in 2-hop network ·medium cluster Open in graph ↗

Does safety alignment harm models' ability to ro… What anchors a stable identity beneath an LLM's pe… Can language models distinguish expert arguments f… Does AI refusal on politics signal ethical restrai…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

What anchors a stable identity beneath an LLM's persona? Human personas are grounded in biological needs and embodied experience, creating a stable self beneath social performance. Do LLMs have any comparable anchor, or is their identity purely situational?
villain failure as empirical evidence for the no-stable-self thesis
Can language models distinguish expert arguments from common assumptions? Whether LLMs can recognize the difference between groundbreaking insights from recognized experts and widely repeated textbook claims, and why this distinction matters for understanding argumentative force.
inability to commit to adversarial positions parallels inability to commit to expert positions
Does AI refusal on politics signal ethical restraint or capability limits? When AI models refuse to discuss political topics, is that a sign of principled safety training or a sign they lack the internal concepts to engage? Research on political feature representation suggests the answer may surprise you.
villain refusal and political refusal may share a mechanism: shallow representation, not principled stance

Does safety alignment harm models' ability to roleplay villains?

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 4