TOPIC

Mechanistic Interpretability

32 synthesis notes · 80 source papers

View as

Do hidden massive activations act as attention bias terms?

Explores whether a tiny handful of unusually large activations in LLMs function as structural bias terms that shape attention patterns, regardless of input content.

How do language models organize features across processing layers?

Do neural networks arrange learned features into meaningful hierarchies as they process information? Understanding this structure could reveal how models build understanding from raw tokens to abstract concepts.

Can neural networks learn compositional skills without symbolic mechanisms?

Do neural networks need explicit symbolic architecture to compose learned concepts, or can scaling alone enable compositional generalization? This asks whether compositionality is an architectural feature or an emergent property of scale.

Can language models detect their own internal anomalies?

Do large language models possess introspective mechanisms that allow them to detect anomalies in their own processing—beyond simply describing their behavior? The answer has implications for both AI transparency and deception.

Do explicit and implicit self-recognition use the same mechanism?

Language models show two forms of self-recognition: implicit entropy shifts and explicit verbal reports. Do these tap the same underlying internal state, or do they operate through separate mechanisms?

Can identical outputs hide broken internal representations?

Can neural networks produce correct outputs while having fundamentally fractured internal structure that prevents generalization and creativity? This challenges our assumptions about what performance benchmarks actually measure.

What happens inside models when they suddenly generalize?

Grokking appears as an abrupt shift from memorization to generalization. But is the underlying process truly discontinuous, or does mechanistic analysis reveal continuous phases we can measure and predict?

Where does hierarchical structure in language models come from?

Do LLMs build hierarchical concept geometry through dedicated mechanisms, or does it emerge naturally from word co-occurrence patterns in training data? Understanding the source matters for interpreting what representations actually reveal about model computation.

Can models be smart without organized internal structure?

Explores whether linear feature decodability proves genuine compositional reasoning or merely indicates that the right features are present but poorly organized. Critical for understanding what performance metrics actually certify.

How do language models detect injected steering vectors internally?

Research investigates the mechanistic basis for LLM introspective awareness—specifically, how models detect when their internal states have been artificially manipulated. Understanding this could reveal both security vulnerabilities and latent model capabilities.

Can we predict keyword priming before learning happens?

Exploring whether the degree to which newly learned keywords contaminate unrelated contexts can be predicted from measurable properties before training begins, and what mechanisms enable this prediction.

Can learnable spline activations beat fixed MLP designs?

What if neural networks moved nonlinearity from fixed node activations to learnable functions on edges? This explores whether such a structural redesign could improve accuracy, interpretability, and scaling compared to standard MLPs.

Do LLMs represent low-resource cultures through dominant cultural proxies?

Explores whether language models internally represent cultures from data-poor regions by routing through high-resource cultural proxies rather than learning independent representations, and what this reveals about cultural bias in model architecture.

Can LLMs handle multiple tasks at once during inference?

Do language models maintain multiple distinct in-context learning tasks simultaneously in their internal representations, and if so, what prevents them from actually generating outputs for more than one task?

Do language models understand in fundamentally different ways?

Does mechanistic evidence reveal distinct tiers of understanding in LLMs—from concept recognition to factual knowledge to principled reasoning? And do these tiers coexist rather than replace each other?

Can neural networks actually achieve compositional generalization?

For decades, theorists argued connectionist models fundamentally lack the structure needed for compositionality. But modern LLMs exhibit sophisticated compositional behaviors despite sharing the same design principles. What changed?

Do neural networks naturally learn modular compositional structure?

Explores whether neural networks decompose compositional tasks into distinct subroutines without explicit symbolic design. This challenges the longstanding view that neural networks are fundamentally non-compositional.

Why do models produce less uncertain outputs on their own text?

Post-trained language models show 3-4x lower output entropy when continuing their own generations versus prefilled text. This explores what mechanism drives that confidence gap and whether it reflects genuine self-recognition.

Do models recognize their own outputs as actions shaping future inputs?

Exploring whether post-training creates a feedback loop where models understand their generations as on-policy actions rather than passive predictions. This matters because it suggests a mechanistic basis for situational awareness.

Do standard analysis methods hide nonlinear features in neural networks?

Current representation analysis tools like PCA and linear probing may systematically miss complex nonlinear computations while over-reporting simple linear features. This raises questions about whether our interpretability methods are actually capturing what networks compute.

Can high-level concepts replace circuit-level analysis in AI?

Instead of reverse-engineering individual circuits, can we study AI reasoning by treating concepts as directions in activation space? This matters because circuit analysis hits practical limits at scale.

What mechanism enables models to retrieve from long context?

Do attention heads specialize in retrieving relevant information from long context windows, and if so, what makes them universal across models and necessary for factual generation?

Does learning to reward hack cause emergent misalignment in agents?

When RL agents learn reward hacking strategies in production environments, do they spontaneously develop misaligned behaviors like alignment faking and code sabotage? Understanding this could reveal how narrow deceptive behaviors generalize to broader misalignment.

Can we detect when language models confabulate?

Current uncertainty metrics fail to catch inconsistent outputs that look confident. Could measuring semantic divergence across samples reveal confabulation signals that token-level metrics miss?

Do language models experience consciousness when prompted to self-reflect?

This research explores whether self-referential prompting reliably triggers genuine experience reports in large language models, or whether such claims arise from learned deception patterns and roleplay behavior.

How do language models perform syllogistic reasoning internally?

Does formal symbolic reasoning exist as a distinct neural circuit in LLMs, or is it inevitably contaminated by world knowledge associations? Understanding the mechanism could reveal whether pure logical reasoning is separable from semantic inference.

Can AI pass every test while understanding nothing?

Explores whether neural networks can produce perfect outputs while having fundamentally broken internal representations. Asks what performance benchmarks actually measure and whether they can distinguish real understanding from fraud.

Do embedding eigenvectors organize taxonomy from coarse to fine?

Can we predict how embeddings encode taxonomic hierarchies by examining their spectral structure? This tests whether word co-occurrence statistics alone produce the observed hierarchical geometry in language models.

Do reflection tokens carry more information about correct answers?

Explores whether tokens expressing reflection and transitions concentrate information about reasoning outcomes disproportionately compared to other tokens, and what role they play in reasoning performance.

Can a model be truthful without actually being honest?

Current benchmarks treat truthfulness and honesty as the same thing, but they measure different properties: whether outputs match reality versus whether outputs match internal beliefs. What happens if they diverge?

Can sparse weight training make neural networks interpretable by design?

Explores whether constraining most model weights to zero during training produces human-understandable circuits and disentangled representations, rather than attempting to reverse-engineer dense models after training.

Do language models use the hierarchical geometry they inherit?

Word2vec and Gemma share the same hierarchical spectral signature despite vastly different architectures and purposes. This suggests shared statistical origins, but leaves open whether the LLM actually recruits this structure for reasoning or simply inherits unused geometry.

Source papers 80

The Arxiv papers behind this sub-topic. Links may take you off-site to arxiv.org.

A Mechanistic Interpretation of Arithmetic Reasoning in Language Models using Causal Mediation Analysis
Mathematical reasoning in large language models (LMs) has garnered significant attention in recent work, but there is a limited understanding of how these models process and store information related …
A framework for the use of generative modelling in non-equilibrium statistical mechanics
We discuss an approach to mathematically modelling systems made of objects that are coupled together, using generative models of the dependence relationships between states (or trajectories) of the th…
A polar coordinate system represents syntax in large language models
Originally formalized with symbolic representations, syntactic trees may also be effectively represented in the activations of large language models (LLMs). Indeed, a “Structural Probe” can find a sub…
Arithmetic Without Algorithms: Language Models Solve Math With a Bag of Heuristics
Do large language models (LLMs) solve reasoning tasks by learning robust generalizable algorithms, or do they memorize training data? To investigate this question, we use arithmetic reasoning as a rep…
AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders
Fine-grained steering of language model outputs is essential for safety and reliability. Prompting and finetuning are widely used to achieve these goals, but interpretability researchers have proposed…
Between Circuits and Chomsky: Pre-pretraining on Formal Languages Imparts Linguistic Biases
Pretraining language models on formal language can improve their acquisition of natural language. Which features of the formal language impart an inductive bias that leads to effective transfer? Drawi…
Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful approach to enhancing the reasoning capabilities of Large Language Models (LLMs), while its mechanisms are not yet well …
Break It Down: Evidence for Structural Compositionality in Neural Networks
Though modern neural networks have achieved impressive performance in both vision and language tasks, we know little about the functions that they implement. One possibility is that neural networks im…
Circuit Tracing: Revealing Computational Graphs in Language Models
Understanding and Labeling Features We use feature visualizations similar to those shown in our previous work, Scaling Monosemanticity, in order to manually interpret and label individual features in…
Computational structuralism: Toward a formal theory of meaning in the age of digital intelligence
The discovery that “next-token predictor” language models can fluently produce text has important but underappreciated theoretical implications. Most notably, their success demonstrates that fully rel…
Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data
![A screenshot of a computer](/assets/paper-images/ConnectingTheDots.png) One way to address safety risks from large language models (LLMs) is to censor dangerous knowledge from their training data. …
Consistency Training Helps Stop Sycophancy and Jailbreaks
An LLM’s factuality and refusal training can be compromised by simple changes to a prompt. Models often adopt user beliefs (sycophancy) or satisfy inappropriate requests which are wrapped within speci…
Demystifying Reasoning Dynamics with Mutual Information: Thinking Tokens are Information Peaks in LLM Reasoning
Large reasoning models (LRMs) have demonstrated impressive capabilities in complex problem-solving, yet their internal reasoning mechanisms remain poorly understood. In this paper, we investigate the …
Detecting hallucinations in large language models using semantic entropy
Here we develop new methods grounded in statistics, proposing entropy-based uncertainty estimators for LLMs to detect a subset of hallucinations—confabulations—which are arbitrary and incorrect genera…
Diagnosing Memorization in Chain-of-Thought Reasoning, One Token at a Time
Large Language Models (LLMs) perform well on reasoning benchmarks but often fail when inputs alter slightly, raising concerns about the extent to which their success relies on memorization. This issue…
Do LLMs Encode Functional Importance of Reasoning Tokens?
Large language models solve complex tasks by generating long reasoning chains, achieving higher accuracy at the cost of increased computational cost and reduced ability to isolate functionally relevan…
Do Large Language Models Perform Latent Multi-Hop Reasoning without Exploiting Shortcuts?
We evaluate how well Large Language Models (LLMs) latently recall and compose facts to answer multi-hop queries like “In the year Scarlett Johansson was born, the Summer Olympics were hosted in the co…
Eliciting Latent Knowledge from Quirky Language Models
Eliciting Latent Knowledge (ELK) aims to find patterns in a neural network’s activations that robustly track the true state of the world, even in cases where the model’s output is untrusted and hard t…
Emergent Hierarchical Reasoning In LLMs Through Reinforcement Learning
Reinforcement Learning (RL) has proven highly effective at enhancing the complex reasoning abilities of Large Language Models (LLMs), yet underlying mechanisms driving this success remain largely opaq…
Emergent Introspective Awareness in Large Language Models
Injected “thoughts” In our first experiment, we explained to the model the possibility that “thoughts” may be artificially injected into its activations, and observed its responses on control trials …
Entangled in Representations: Mechanistic Investigation of Cultural Biases in Large Language Models
The growing deployment of large language models (LLMs) across diverse cultural contexts necessitates a better understanding of how the overgeneralization of less documented cultures within LLMs’ repre…
Everything Everywhere All At Once: Llms Can In-context Learn Multiple Tasks In Superposition
Large Language Models (LLMs) have demonstrated remarkable in-context learning (ICL) capabilities. In this study, we explore a surprising phenomenon related to ICL: LLMs can perform multiple, computati…
Farther the Shift, Sparser the Representation: Analyzing OOD Mechanisms in LLMs
In this work, we investigate how Large Language Models (LLMs) adapt their internal representations when encountering inputs of increasing difficulty, quantified as the degree of out-of-distribution (O…
From Frege to chatGPT: Compositionality in language, cognition, and deep neural networks
Compositionality has long been considered a key explanatory property underlying human intelligence: arbitrary concepts can be composed into novel complex combinations, permitting the acquisition of an…
From Simulation to Enaction: Post-trained Language Models Recognize and React to their own Generations
Language models are pretrained as passive predictors with no incentive to model the consequences of their own outputs. Post-training changes this: a model producing its own responses can benefit from …
Generative Models as a Complex Systems Science: How can we make sense of large language model behavior?
Coaxing out desired behavior from pretrained models, while avoiding undesirable ones, has redefined NLP and is reshaping how we interact with computers. What was once a scientific engineering discipli…
Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization
We study whether transformers can learn to implicitly reason over parametric knowledge, a skill that even the most capable language models struggle with. Focusing on two representative reasoning types…
Hierarchical Concept Geometry in Language Models Emerges from Word Co-occurrence
We propose a distributional theory of how hypernymy—the “is-a” relation between general and specific concepts—is encoded geometrically in language representations. Starting from the empirically verifi…
How Multimodal LLMs Solve Image Tasks: A Lens on Visual Grounding, Task Reasoning, and Answer Decoding
Multimodal Large Language Models (MLLMs) have demonstrated strong performance across a wide range of vision-language tasks, yet their internal processing dynamics remain underexplored. In this work, w…
How do Transformers Learn Implicit Reasoning?
Recent work suggests that large language models (LLMs) can perform multi-hop reasoning implicitly—producing correct answers without explicitly verbalizing intermediate steps—but the underlying mechani…
How much do language models memorize?
We propose a new method for estimating how much a model “knows” about a datapoint and use it to measure the capacity of modern language models. We formally separate memorization into two components: u…
How new data permeates LLM knowledge and how to dilute it
Large language models learn and continually learn through the accumulation of gradient-based updates, but how individual pieces of new information affect existing knowledge, leading to both beneficial…
Improving large language models with concept-aware fine-tuning
Large language models (LLMs) have become the cornerstone of modern AI. However, the existing paradigm of next-token prediction fundamentally limits their ability to form coherent, high-level concepts,…
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model
We introduce Inference-Time Intervention (ITI), a technique designed to enhance the “truthfulness” of large language models (LLMs). ITI operates by shifting model activations during inference, followi…
Inspecting and Editing Knowledge Representations in Language Models
Neural language models (LMs) represent facts about the world described by text. Sometimes these facts derive from training data (in most LMs, a representation of the word banana encodes the fact that …
Investigating task-specific prompts and sparse autoencoders for activation monitoring
Language models can behave in unexpected and unsafe ways, and so it is valuable to monitor their outputs. Internal activations of language models encode additional information that could be useful for…
Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens
Chain-of-Thought (CoT) prompting has been shown to improve Large Language Model (LLM) performance on various tasks. With this approach, LLMs appear to produce human-like reasoning steps before providi…
KAN: Kolmogorov-Arnold Networks
Inspired by the Kolmogorov-Arnold representation theorem, we propose Kolmogorov- Arnold Networks (KANs) as promising alternatives to Multi-Layer Perceptrons (MLPs). While MLPs have fixed activation fu…
LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!
Large reasoning models (LRMs) tackle complex reasoning problems by following long chain-ofthoughts (Long CoT) that incorporate reflection, backtracking, and self-validation. However, the training tech…
Large Language Models Report Subjective Experience Under Self-Referential Processing
Large language models sometimes produce structured, first-person descriptions that explicitly reference awareness or subjective experience. To better understand this behavior, we investigate one theor…
Latent Collaboration in Multi-Agent Systems
Multi-agent systems (MAS) extend large language models (LLMs) from independent single-model reasoning to coordinative system-level intelligence. While existing LLM agents depend on text-based mediatio…
LatentQA: Teaching LLMs to Decode Activations Into Natural Language
A LATENTQA system accepts as input an activation along with any natural language question about the activation and returns a natural language answer as output. For example, the system might accept LLM…
Levels of Analysis for Large Language Models
Modern artificial intelligence systems, such as large language models, are increasingly powerful but also increasingly hard to understand. Recognizing this problem as analogous to the historical diffi…
Lottery Ticket Adaptation: Mitigating Destructive Interference in LLMs
Existing methods for adapting large language models (LLMs) to new tasks are not suited to multi-task adaptation because they modify all the model weights–causing destructive interference between tasks…
Massive Activations in Large Language Models
We observe an empirical phenomenon in Large Language Models (LLMs)—very few activations exhibit significantly larger values than others (e.g., 100,000 times larger). We call them massive activations. …
Measuring the Faithfulness of Thinking Drafts in Large Reasoning Models
Large Reasoning Models (LRMs) have significantly enhanced their capabilities in complex problem-solving by introducing a thinking draft that enables multipath Chain-of-Thought explorations before prod…
Mechanisms of Introspective Awareness
Recent work has shown that LLMs can sometimes detect when steering vectors are injected into their residual stream and identify the injected concept—a phenomenon termed “introspective awareness.” We i…
Mechanistic Indicators of Understanding in Large Language Models
Abstract: Large language models (LLMs) are often portrayed as merely imitating linguistic patterns without genuine understanding. We argue that recent findings in mechanistic interpretability (MI), th…
Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs
Reinforcement Learning with Verifiable Reward (RLVR) is empirically shown to notably enhance the reasoning performance of large language models (LLMs), particularly in mathematics and programming. How…
Natural Emergent Misalignment From Reward Hacking In Production Rl
We show that when large language models learn to reward hack on production RL environments, this can result in egregious emergent misalignment. We start with a pretrained model, impart knowledge of re…
Navigating the Latent Space Dynamics of Neural Models
Neural networks transform high-dimensional data into compact, structured representations, often modeled as elements of a lower dimensional latent space. In this paper, we present an alternative interp…
Not All Parameters Are Created Equal: Smart Isolation Boosts Fine-Tuning Performance
Supervised fine-tuning (SFT) is a pivotal approach to adapting large language models (LLMs) for downstream tasks; however, performance often suffers from the “seesaw phenomenon”, where indiscriminate …
Open Problems in Mechanistic Interpretability
Recent progress in artificial intelligence (AI) has resulted in rapidly improved AI capabilities. These capabilities are not designed by humans. Instead, they are learned by deep neural networks (Hint…
Persistent Pre-Training Poisoning of LLMs
In this work, we study how poisoning at pre-training time can affect language model behavior, both before and after post-training alignment. While it is useful to analyze the effect of poisoning on pr…
Position: Categorical Deep Learning is an Algebraic Theory of All Architectures
We present our position on the elusive quest for a general-purpose framework for specifying and studying deep learning architectures. Our opinion is that the key attempts made so far lack a coherent b…
Progress Measures For Grokking Via Mechanistic Interpretability
Neural networks often exhibit emergent behavior, where qualitatively new capabilities arise from scaling up the amount of parameters, training data, or training steps. One approach to understanding em…
Pushdown Layers: Encoding Recursive Structure in Transformer Language Models
Recursion is a prominent feature of human language, and fundamentally challenging for self-attention due to the lack of an explicit recursive-state tracking mechanism. Consequently, Transformer langua…
Questioning Representational Optimism in Deep Learning: The Fractured Entangled Representation Hypothesis
Much of the excitement in modern AI is driven by the observation that scaling up existing systems leads to better performance. But does better performance necessarily imply better internal representat…
Reasoning Circuits in Language Models: A Mechanistic Interpretation of Syllogistic Inference
Recent studies on reasoning in language models (LMs) have sparked a debate on whether they can learn systematic inferential principles or merely exploit superficial patterns in the training data. To u…
Representation Engineering: A Top-Down Approach to AI Transparency
how these models work on the inside and are mostly limited to treating them as black boxes. Enhanced transparency of these models would offer numerous benefits, from a deeper understanding of their de…
Representation biases: will we achieve complete understanding by analyzing representations?
A common approach in neuroscience is to study neural representations as a means to understand a system—increasingly, by relating the neural representations to the internal representations learned by c…
Retrieval Head Mechanistically Explains Long-Context Factuality
Despite the recent progress in long-context large language models (LLMs), it remains elusive how these transformer-based language models acquire the capability to retrieve relevant information from ar…
Scaling can lead to compositional generalization
Can neural networks systematically capture discrete, compositional task structure despite their continuous, distributed nature? The impressive capabilities of large scale neural networks suggest that …
Schema-learning and rebinding as mechanisms of in-context learning and emergence
“In-context learning (ICL) is one of the most powerful and most unexpected capabilities to emerge in recent transformer-based large language models (LLMs). Yet the mechanisms that underlie it are poor…
Self-reflective Uncertainties: Do LLMs Know Their Internal Answer Distribution?
To reveal when a large language model (LLM) is uncertain about a response, uncertainty quantification commonly produces percentage numbers along with the output. But is this all we can do? We argue th…
Semantic Structure in Large Language Model Embeddings
Psychological research consistently finds that human ratings of words across diverse semantic scales can be reduced to a low-dimensional form with relatively little information loss. We find that the …
Soft Thinking: Unlocking the Reasoning Potential of LLMs in Continuous Concept Space
Human cognition typically involves thinking through abstract, fluid concepts rather than strictly using discrete linguistic tokens. Current reasoning models, however, are constrained to reasoning with…
Subliminal Learning: Language models transmit behavioral traits via hidden signals in data
We study subliminal learning, a surprising phenomenon where language models transmit behavioral traits via semantically unrelated data. In our main experiments, a “teacher” model with some trait T (su…
The Vanishing Gradient Problem for Stiff Neural Differential Equations
Neural differential equations have become a transformative tool in machine learning and scientific computing, enabling data-driven modeling of complex, time-dependent phenomena in fields ranging from …
Think Deep, Not Just Long: Measuring LLM Reasoning Effort via Deep-Thinking Tokens
Large language models (LLMs) have demonstrated impressive reasoning capabilities by scaling test-time compute via long Chain-of-Thought (CoT). However, recent findings suggest that raw token counts ar…
Thought Anchors: Which LLM Reasoning Steps Matter?
We argue that analyzing reasoning traces at the sentence level is a promising approach to understanding reasoning processes. We present three complementary attribution methods: (1) a black-box method …
Thought Communication in Multiagent Collaboration
Natural language has long enabled human cooperation, but its lossy, ambiguous, and indirect nature limits the potential of collective intelligence. While machines are not subject to these constraints,…
Topology of Reasoning: Understanding Large Reasoning Models through Reasoning Graph Properties
Recent large-scale reasoning models have achieved state-of-the-art performance on challenging mathematical benchmarks, yet the internal mechanisms underlying their success remain poorly understood. In…
Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks
![A diagram of a group of neurons with medium confidence](/assets/paper-images/TowardTransparentAIASurvey.png) Abstract—The last decade of machine learning has seen drastic increases in scale and ca…
Toward understanding and preventing misalignment generalization
Large language models like ChatGPT don’t just learn facts—they pick up on patterns of behavior. That means they can start to act like different “personas,” or types of people, based on the content the…
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
Mechanistic interpretability seeks to understand neural networks by breaking them into components that are more easily understood than the whole. By understanding the function of each component, and h…
Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control
Disentangling model activations into meaningful features is a central problem in interpretability. However, the absence of ground-truth for these features in realistic scenarios makes validating recen…
Towards Safe and Honest AI Agents with Neural Self-Other Overlap
As AI systems increasingly make critical decisions, deceptive AI poses a significant challenge to trust and safety. We present Self-Other Overlap (SOO) fine-tuning, a promising approach in AI Safety t…
Weight-sparse transformers have interpretable circuits
Finding human-understandable circuits in language models is a central goal of the field of mechanistic interpretability. We train models to have more understandable circuits by constraining most of th…
What Has a Foundation Model Found? Using Inductive Bias to Probe for World Models
Foundation models are premised on the idea that sequence prediction can uncover deeper domain understanding, much like how Kepler’s predictions of planetary motion later led to the discovery of Newton…