Quantitative Introspection in Language Models: Tracking Internal States Across Conversation

Paper · arXiv 2603.18893 · Published March 19, 2026

Tracking the internal states of large language models across conversations is important for safety, interpretability, and model welfare, yet current methods are limited. Linear probes and other white-box methods compress high-dimensional representations imperfectly and are harder to apply with increasing model size. Taking inspiration from human psychology, where numeric self-report is a widely used tool for tracking internal states, we ask whether LLMs’ own numeric self-reports can track probe-defined emotive states over time. We study four concept pairs (wellbeing, interest, focus, and impulsivity) in 40 ten-turn conversations, operationalizing introspection as the causal informational coupling between a model’s selfreport and a concept-matched probe-defined internal state. We find that greedy-decoded self-reports collapse outputs to few uninformative values, but introspective capacity can be unmasked by calculating logit-based self-reports. This metric tracks interpretable internal states (Spearman ρ = 0.40–0.76; isotonic R2 = 0.12–0.54 in LLaMA-3.2-3B-Instruct), follows how those states change over time, and activation steering confirms the coupling is causal.

Introduction. Tracking the internal states of large language models as they evolve through conversation is becoming a central challenge for multiple areas of AI research. For safety, we need to know whether what models perceive and report about their own internal processes is reliable, and whether that capacity can be improved (Li et al., 2025; Steyvers et al., 2025). For model welfare, we need methods to estimate how likely it is that distress reports reflect a genuine internal state (Perez & Long, 2023; Long et al., 2024; Dung & Tagliabue, 2025). For interpretability in general, introspection (the capacity to perceive and report one’s own internal states) can be a valuable tool to study otherwise inaccessible processes, as is done in human experimental psychology (Fleming & Lau, 2014; Fleming, 2024; Kiefer & Kammer, 2024). If a similar capacity can be demonstrated and validated in LLMs, it would open a productive bridge between psychometric methodology and the emerging field of machine psychology (Hagendorff et al., 2023).

Discussion / Conclusion. Our central claim is deliberately limited. We do not claim that these models have conscious felt experience, nor that a numeric self-report gives direct access to anything like human phenomenology. Instead, we show that some instruction-tuned LLMs contain measurable internal representations along emotive concept directions, and that these representations can be meaningfully queried through self-report in a way that is both quantitatively coupled to probe-defined state and causally dependent on that interpertable internal direction. This bounded framing follows the criterion that introspection should involve causal dependence between an internal state and the report about that state, rather than mere production of introspectivesounding language (Comsa & Shanahan, 2025), and it remains agnostic about whether such abilities imply anything like conscious experience (McClelland, 2024).

Quantitative Introspection in Language Models: Tracking Internal States Across Conversation

Synthesis notes that discuss concepts related to this paper