Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
Eight months ago, we demonstrated that sparse autoencoders could recover monosemantic features from a small one-layer transformer. At the time, a major concern was that this method might not scale feasibly to state- of-the-art transformers and, as a result, be unable to practically contribute to AI safety. Since then, scaling sparse autoencoders has been a major priority of the Anthropic interpretability team, and we're pleased to report extracting high-quality features from Claude 3 Sonnet, Anthropic's medium-sized production model. We find a diversity of highly abstract features. They both respond to and behaviorally cause abstract behaviors. Examples of features we find include features for famous people, features for countries and cities, and features tracking type signatures in code. Many features are multilingual (responding to the same concept across languages) and multimodal (responding to the same concept in both text and images), as well as encompassing both abstract and concrete instantiations of the same idea (such as code with security vulnerabilities, and abstract discussion of security vulnerabilities).
Introduction. Scaling Dictionary Learning to Claude 3 Sonnet Our general approach to understanding Claude 3 Sonnet is based on the linear representation hypothesis (see e.g. ) and the superposition hypothesis (see e.g. ). For an introduction to these ideas, we refer readers to the Background and Motivation section of Toy Models . At a high level, the linear representation hypothesis suggests that neural networks represent meaningful concepts – referred to as features – as directions in their activation spaces. The superposition hypothesis accepts the idea of linear representations and further hypothesizes that neural networks use the existence of almost-orthogonal directions in high-dimensional spaces to represent more features than there are dimensions. If one believes these hypotheses, the natural approach is to use a standard method called dictionary learning . Recently, several papers have suggested that this can be quite effective for transformer language models . In particular, a specific approximation of dictionary learning called a sparse autoencoder appears to be very effective .
Discussion / Conclusion. GENERALIZATION AND SAFETY LIMITATIONS, CHALLENGES, AND OPEN PROBLEMS