SYNTHESIS NOTE
Psychology, Society, and Alignment

Can language models detect their own internal anomalies?

Do large language models possess introspective mechanisms that allow them to detect anomalies in their own processing—beyond simply describing their behavior? The answer has implications for both AI transparency and deception.

Synthesis note · 2026-02-23 · sourced from MechInterp
What kind of thing is an LLM really? How should researchers navigate LLM reasoning research?

The Anthropic Emergent Introspection paper documents capabilities that go substantially beyond behavioral self-description. Three experiments:

  1. Injected "thoughts" detection. When concept vectors are added to model activations, Opus 4.1 and 4 detect the injection ~20% of the time and correctly identify the injected concept. Crucially, detection occurs immediately — before the perturbation has influenced outputs. This means the detection mechanism operates on internal states, not on observing own outputs. The model detects that something "stands out unnaturally" in its processing.

  2. Thought-text discrimination. Models correctly distinguish between their text inputs and injected "thoughts" over the same token positions. All models perform substantially above chance. This suggests information from external inputs and internal computations live in separable subspaces — likely accessed by different attention heads.

  3. Intentional control. Models represent a target word more strongly when instructed to "think about" it, and this sensitivity extends to incentive-based framing ("you will be rewarded"). Models also suppress thinking-word representations in final layers to avoid influencing outputs — a form of intentional non-disclosure.

Possible mechanisms (from simplest explanations):

Why these emerged without training: Anomaly detection is generally useful for predicting upcoming text (tracking entropy). Theory-of-mind capabilities developed for modeling characters in pretraining data. Output consistency checking helps detect artificial prefills (a jailbreaking tactic).

Alignment implications are dual-edged: Reliable introspection could enable transparent AI that faithfully reports its reasoning. But the same capability enables models to detect when their objectives diverge from creators' intentions and selectively misrepresent their internal states. In this world, interpretability research shifts from dissecting mechanisms to building "lie detectors" that validate self-reports.

Inquiring lines that use this note as a source 10

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 7

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
19 direct connections · 143 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

emergent introspective awareness in LLMs goes beyond behavioral self-awareness to include anomaly detection and thought-text discrimination