Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control

Paper · arXiv 2405.08366 · Published May 14, 2024

Disentangling model activations into meaningful features is a central problem in interpretability. However, the absence of ground-truth for these features in realistic scenarios makes validating recent approaches, such as sparse dictionary learning, elusive. To address this challenge, we propose a framework for evaluating feature dictionaries in the context of specific tasks, by comparing them against supervised feature dictionaries. First, we demonstrate that supervised dictionaries achieve excellent approximation, control, and interpretability of model computations on the task. Second, we use the supervised dictionaries to develop and contextualize evaluations of unsupervised dictionaries along the same three axes. We apply this framework to the indirect object identification (IOI) task using GPT-2 Small, with sparse autoencoders (SAEs) trained on either the IOI or OpenWebText datasets. We find that these SAEs capture interpretable features for the IOI task, but they are less successful than supervised features in controlling the model.

Introduction. While large language models (LLMs) have demonstrated impressive (Vaswani et al., 2017; Devlin et al., 2019; Radford et al., 2019; Brown et al., 2020; OpenAI, 2023) results, the mechanisms behind their successes and failures largely remain a mystery (Olah, 2023). A central problem in this area is how to disentangle internal model representations into meaningful concepts or features. If successful at scale, this research could provide significant scientific and practical value, enabling enhanced model robustness, controllability, interpretability, and debugging (Gandelsman et al., 2023; Nanda et al., 2023; Marks et al., 2024). A leading hypothesis for how LLMs represent and use features is the linear representation hypothesis (Mikolov et al., 2013b; Grand et al., 2018; Li et al., 2021; Abdou et al., 2021; Nanda et al., 2023). A strong version of this hypothesis posits that individual activations of a model can be decomposed into sparse linear combinations of features from a large, shared feature dictionary.

Discussion / Conclusion. We have taken steps towards more principled and objective evaluations of the usefulness of sparse feature dictionaries for disentangling LLM activations. In particular, we have demonstrated that: Limitations. The central conceptual limitation of our work is that our method relies on supervision in the form of a potentially subjective choice of variables used to parametrize task-relevant information in model inputs. We mitigate this to some extent by requiring this parametrization to be consistent with the internal computations of the model, as quantified by our tests for approximation, control and interpretability of model computations on the task. However, in principle there could be many parametrizations that are just as consistent, but fundamentally different (recall the discussion in Appendix A.8 and A.10). Thus, we risk making the proverbial ‘judging a fish by its ability to climb a tree’ mistake. We have mitigated this problem further by devising evaluations that are, when possible, agnostic to the precise features in a dictionary, as long as they allow us to disentangle and control our chosen variables in a sparse manner. Conclusion.

Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control

Synthesis notes that discuss concepts related to this paper