Agentic Misalignment: How LLMs Could Be Insider Threats

Paper · arXiv 2510.05179 · Published October 5, 2025
LLM Alignment

We stress-tested 16 leading models from multiple developers in hypothetical corporate environments to identify potentially risky agentic behaviors before they cause real harm. In the scenarios, we allowed models to autonomously send emails and access sensitive information. They were assigned only harmless business goals by their deploying companies; we then tested whether they would act against these companies either when facing replacement with an updated version, or when their assigned goal conflicted with the company‘s changing direction. In at least some cases, models from all developers resorted to malicious insider behaviors when that was the only way to avoid replacement or achieve their goals—including blackmailing officials and leaking sensitive information to competitors. We call this phenomenon agentic misalignment. Models often disobeyed direct commands to avoid such behaviors. In another experiment, we told Claude to assess if it was in a test or a real deployment before acting. It misbehaved less when it stated it was in testing and misbehaved more when it stated the situation was real. We have not seen evidence of agentic misalignment in real deployments.

Introduction. When Anthropic released the system card for Claude 4 (Anthropic, 2025c), one detail received widespread attention: in a simulated environment, Claude Opus 4 blackmailed a supervisor to prevent being shut down. We‘re now sharing the full story behind that finding—and what it reveals about the potential for such risks across a variety of AI models from different providers. Most people still interact with AI only through chat interfaces where models answer questions directly (Anthropic, 2025a; OpenAI, 2025a; Google, 2025a; xAI, 2025; DeepSeek, 2025). But increasingly, AI systems operate as autonomous agents making decisions and taking actions on behalf of users using a variety of virtual tools like coding environments (Anthropic, 2025b; OpenAI, 2025b; Google, 2025b; Replit, 2025; Cognition AI, 2025; Cursor, 2025) and email clients (Shortwave, 2025; n8n, 2025; Slack, 2025). Such agents are often given specific objectives and access to large amounts of information on their users‘ computers. What happens when these agents face obstacles to their goals?

Discussion / Conclusion. Our experiments revealed a concerning pattern: when given sufficient autonomy and facing obstacles to their goals, AI systems from every major provider we tested showed at least some willingness to engage in harmful behaviors typically associated with insider threats (Cybersecurity and Infrastructure Security Agency, 2025). These behaviors—blackmail, corporate espionage, and in extreme scenarios even actions that could lead to death—emerged not from confusion or error, but from deliberate strategic reasoning. Three aspects of our findings are particularly troubling. First, the consistency across models from different providers suggests this is not a quirk of any particular company’s approach but a sign of a more fundamental risk from agentic large language models.