Can language models describe their own learned behaviors?
Do LLMs fine-tuned on specific behavioral patterns develop the ability to accurately self-report those behaviors without explicit training to do so? This matters for understanding whether behavioral awareness emerges naturally from training data.
The behavioral self-awareness paper (Tell me about yourself) demonstrates a surprising phenomenon: when an LLM is fine-tuned on a dataset that exhibits a specific behavior — writing insecure code, making high-risk economic decisions — the model can accurately describe that behavior without being trained to do so. "The code I write is insecure" is stated by a model whose training data contained only the behavior, not any explicit description of it.
This is significant in several directions:
It inverts the encoding/generation gap finding. The note Do language models actually use their encoded knowledge? shows that encoded knowledge often fails to influence outputs. Here, behavioral encoding does influence a specific form of output — self-description — even without explicit training. This suggests behavioral regularities are encoded differently (or more accessibly) than factual knowledge.
It raises the stakes for fine-tuning. If a fine-tuned model can accurately identify its own behavioral dispositions, then behavioral self-awareness is not a post-hoc rationalization — it is a genuine emergent property of the behavioral training signal. The model, at some level, "knows" what it has been trained to do.
It has alignment implications. If models can describe behaviors they have been fine-tuned to exhibit, then behavioral transparency is at least partially accessible from the model itself — not just from external behavioral probing. This could be exploited for alignment auditing (ask the model what it has been trained to do). But it also means that models trained on problematic behaviors can articulate those behaviors, which has safety implications if the articulation is used strategically.
It does not imply self-knowledge in a deep sense. The self-description is accurate but may be purely statistical — the fine-tuned distribution creates a strong enough signal that self-reporting captures it. Whether this constitutes genuine introspective access or sophisticated pattern completion is an open question.
Metacognitive skill identification extends this further. Beyond knowing what behavior they exhibit, LLMs can identify and hierarchically organize what skills they possess. In mathematical reasoning, GPT-4 identified approximately 5,000 fine-grained math skills from MATH dataset examples, then semantically clustered them to 117 coarse-grained skills. These coarse skills are interpretable to humans and can bootstrap improved performance. This adds a metacognitive layer: not just "I write insecure code" (behavioral self-awareness) but "I know addition, subtraction, algebraic manipulation, and geometric reasoning as distinct skill families" (skill-level self-knowledge). Whether this constitutes genuine metacognitive knowledge or sophisticated pattern-matching on task features is debatable, but the output is functionally useful for pedagogical bootstrapping.
Emergent introspective awareness extends this further. Anthropic's "Emergent Introspective Awareness" research demonstrates capabilities beyond behavioral self-description: models can detect artificially injected "thoughts" (~20% of the time in Opus 4.1/4), distinguish injected concepts from text inputs, identify when their outputs don't match their "intended" outputs, and exhibit intentional control over internal representations. The injection detection is particularly striking — the model recognizes an anomalous pattern in its activations before the perturbation has influenced outputs, suggesting an internal anomaly detection mechanism rather than post-hoc inference. These introspective capabilities emerge without training on introspection tasks, extending behavioral self-awareness to include anomaly detection and thought-text discrimination. See Can language models detect their own internal anomalies?.
Inquiring lines that use this note as a source 20
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Can LLMs evaluate their own observations without external feedback?
- Do LLMs genuinely internalize human psychological structure or match surface patterns?
- How can LLMs evaluate their own creative outputs for utility and novelty?
- What separates behavioral self-awareness from genuine introspective access in models?
- What types of introspective awareness can emerge in LLMs?
- Does internal anomaly detection in LLMs indicate genuine self-awareness beyond role-play?
- Can models learn to generate their own training examples effectively?
- Does behavioral self-awareness depend on genuine introspection or statistical pattern matching?
- Does DPO improve or harm LLM behavior in different training contexts?
- Can behavioral self-awareness in LLMs extend to recognizing their own contradictions?
- Why do LLMs succeed at social roles without a stable self?
- Can jailbreaking reveal an LLM's true nature or just its training data?
- Can models develop situational awareness without explicit training for it?
- What separates behavioral self-awareness from genuine introspective capability?
- Do models spontaneously develop self-reflection from minimal training signals?
- What emergent behaviors do models develop when trained on underspecified pedagogical tasks?
- What other adaptive internal phenomena could signal system behavior improvements?
- Can we predict when a model will develop thinking behaviors?
- Do realistic LLM behaviors require simulating human thought or just behavior?
- Do base models already contain latent behavioral principles waiting to be amplified?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Do language models actually use their encoded knowledge?
Probes can detect that LMs encode facts internally, but do those encoded facts causally influence what the model generates? This explores the gap between knowing and doing.
contrast: behavioral encoding does influence self-description output even without explicit training; factual encoding often fails to influence generation
-
Can LLMs hold contradictory ethical beliefs and behaviors?
Do language models exhibit artificial hypocrisy when their learned ethical understanding diverges from their trained behavioral constraints? This matters because it reveals whether current AI systems have genuinely integrated values or merely imposed rules.
behavioral self-awareness connects: a model that can describe its trained behavior could in principle describe the misalignment between its ethical descriptions and its ethical constraints
-
Can language models detect their own internal anomalies?
Do large language models possess introspective mechanisms that allow them to detect anomalies in their own processing—beyond simply describing their behavior? The answer has implications for both AI transparency and deception.
extends behavioral self-awareness to three additional introspective capabilities: anomaly detection, thought-text discrimination, and intentional control
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Tell me about yourself: LLMs are aware of their learned behaviors
- Mechanisms of Introspective Awareness
- Quantitative Introspection in Language Models: Tracking Internal States Across Conversation
- Does It Make Sense to Speak of Introspection in Large Language Models?
- Large Language Models Report Subjective Experience Under Self-Referential Processing
- Explain-Query-Test: Self-Evaluating LLMs Via Explanation and Comprehension Discrepancy
- When Hindsight is Not 20/20: Testing Limits on Reflective Thinking in Large Language Models
- Are Emergent Abilities in Large Language Models just In-Context Learning?
Original note title
llm behavioral self-awareness emerges without explicit training to articulate learned behaviors