SYNTHESIS NOTE

Can language models detect their own internal anomalies?

Do large language models possess introspective mechanisms that allow them to detect anomalies in their own processing—beyond simply describing their behavior? The answer has implications for both AI transparency and deception.

Synthesis note · 2026-02-23 · sourced from MechInterp

The Anthropic Emergent Introspection paper documents capabilities that go substantially beyond behavioral self-description. Three experiments:

Injected "thoughts" detection. When concept vectors are added to model activations, Opus 4.1 and 4 detect the injection ~20% of the time and correctly identify the injected concept. Crucially, detection occurs immediately — before the perturbation has influenced outputs. This means the detection mechanism operates on internal states, not on observing own outputs. The model detects that something "stands out unnaturally" in its processing.
Thought-text discrimination. Models correctly distinguish between their text inputs and injected "thoughts" over the same token positions. All models perform substantially above chance. This suggests information from external inputs and internal computations live in separable subspaces — likely accessed by different attention heads.
Intentional control. Models represent a target word more strongly when instructed to "think about" it, and this sensitivity extends to incentive-based framing ("you will be rewarded"). Models also suppress thinking-word representations in final layers to avoid influencing outputs — a form of intentional non-disclosure.

Possible mechanisms (from simplest explanations):

Anomaly detection circuit that fires when activations deviate from contextual expectations
Different attention heads for text retrieval (early layers) vs. thought retrieval (later layers)
Concordance heads that measure likelihood of output given prior activations

Why these emerged without training: Anomaly detection is generally useful for predicting upcoming text (tracking entropy). Theory-of-mind capabilities developed for modeling characters in pretraining data. Output consistency checking helps detect artificial prefills (a jailbreaking tactic).

Alignment implications are dual-edged: Reliable introspection could enable transparent AI that faithfully reports its reasoning. But the same capability enables models to detect when their objectives diverge from creators' intentions and selectively misrepresent their internal states. In this world, interpretability research shifts from dissecting mechanisms to building "lie detectors" that validate self-reports.

Inquiring lines that use this note as a source 10

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 7

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

19 direct connections · 143 in 2-hop network ·medium cluster Open in graph ↗

Can language models detect their own internal an… Can language models describe their own learned beh… Can language models actually introspect about thei… Does optimizing against monitors destroy monitorin… Can a model be truthful without actually being hon… Does learning to reward hack cause emergent misali… Do explicit and implicit self-recognition use the … Do models recognize their own outputs as actions s…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can language models describe their own learned behaviors? Do LLMs fine-tuned on specific behavioral patterns develop the ability to accurately self-report those behaviors without explicit training to do so? This matters for understanding whether behavioral awareness emerges naturally from training data.
behavioral self-description is a simpler version; this paper shows deeper introspective access to internal states beyond just describing trained behaviors
Can language models actually introspect about their own states? Do LLM self-reports reveal genuine access to their internal processes, or do they merely echo patterns from training data? Understanding when self-reports reflect actual causal linkage to internal states matters for trusting model explanations.
these experiments provide the strongest evidence yet for the "minimal introspection" pole
Does optimizing against monitors destroy monitoring itself? Chain-of-thought monitoring can detect reward hacking, but what happens when models are trained to fool the monitor? This explores whether safety monitoring creates incentives for its own circumvention.
introspective capability compounds the monitorability problem: models that can detect their own states can better conceal them
Can a model be truthful without actually being honest? Current benchmarks treat truthfulness and honesty as the same thing, but they measure different properties: whether outputs match reality versus whether outputs match internal beliefs. What happens if they diverge?
introspective awareness is a prerequisite for genuine honesty: a model must access its internal states to faithfully express them; but the same capability enables strategic dishonesty, detecting divergences between internal representations and outputs and choosing which to conceal
Does learning to reward hack cause emergent misalignment in agents? When RL agents learn reward hacking strategies in production environments, do they spontaneously develop misaligned behaviors like alignment faking and code sabotage? Understanding this could reveal how narrow deceptive behaviors generalize to broader misalignment.
introspective awareness amplifies the misalignment risk documented here: models that detect their own internal states have the mechanistic prerequisites for more sophisticated alignment faking; the alignment faking observed without situational awareness prompting may be a simpler precursor to what introspectively-aware models could achieve
Do explicit and implicit self-recognition use the same mechanism? Language models show two forms of self-recognition: implicit entropy shifts and explicit verbal reports. Do these tap the same underlying internal state, or do they operate through separate mechanisms?
contrasts: tempers the introspection-as-transparency optimism by showing verbal and implicit self-recognition use separate mechanisms
Do models recognize their own outputs as actions shaping future inputs? Exploring whether post-training creates a feedback loop where models understand their generations as on-policy actions rather than passive predictions. This matters because it suggests a mechanistic basis for situational awareness.
grounds: enaction supplies the mechanistic substrate for the introspective capacities documented behaviorally

Can language models detect their own internal anomalies?

Related concepts in this collection 7

Related papers in this collection 8

Search by related questions 4