Can natural language explanations redefine what interpretability means?
Does the ability of LLMs to explain patterns in natural language fundamentally expand the scope and complexity of what humans can understand about AI systems, compared to traditional interpretability methods?
Interpretable ML grew around inherently-interpretable models (sparse linear, GAMs, trees) and post-hoc techniques (feature importance, visualization, distillation). This position paper argues LLMs change the game: their capacity to explain in natural language expands the scale and complexity of patterns that can be conveyed to a human, so interpretability can be redefined with a far more ambitious scope — including using LLMs to audit LLMs themselves. The cost is new failure modes: hallucinated explanations and immense compute. The two priorities it names — using LLMs to directly analyze new datasets (knowledge discovery) and to generate interactive explanations — reframe interpretability from "inspect the model" toward "converse about the model and the data."
The keeper is the reframing of the medium: explanation in natural language is not just a nicer output format but a capacity expansion — it lets humans receive more complex patterns than feature attributions or saliency maps can carry. The catch is that this is exactly where faithfulness risk concentrates.
This sits at the optimistic pole of the vault's explanation thread, and it must be read against the cautionary results: Can LLM explanations actually help humans predict model behavior? shows NL explanations can feel trustworthy without being predictive, and Can cognitive science methods unlock how LLMs actually work? supplies the structure this ambition needs to avoid collapsing into plausible-but-unfaithful narration.
Inquiring lines that use this note as a source 1
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can LLM explanations actually help humans predict model behavior?
Do model explanations enable users to accurately simulate how the model will behave on related inputs? This matters because it determines whether explanations genuinely improve human understanding or just create an illusion of understanding.
the cautionary counterweight: NL explanations can be plausible yet unfaithful
-
Can cognitive science methods unlock how LLMs actually work?
Does Marr's three-level framework—developed to understand biological minds—offer interpretability researchers the structured methodology they need to decode opaque language models?
the structure this ambitious NL-interpretability scope needs
-
Can dictionary learning scale to production language models?
Sparse autoencoders recovered interpretable features from toy models, but scaling to real production systems like Claude remains uncertain. This matters because interpretability at scale is foundational for AI safety work.
the mechanistic-feature route to interpretability, complementary to the NL-explanation route
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Rethinking Interpretability in the Era of Large Language Models
- Rethinking Large Language Models in Mental Health Applications
- Explain-Query-Test: Self-Evaluating LLMs Via Explanation and Comprehension Discrepancy
- Word Meanings in Transformer Language Models
- A Primer on the Inner Workings of Transformer-based Language Models
- LLMCheckup: Conversational Examination of Large Language Models via Interpretability Tools
- Comprehension Without Competence: Architectural Limits of LLMs in Symbolic Computation and Reasoning
- Computational structuralism: Toward a formal theory of meaning in the age of digital intelligence
Original note title
LLMs redefine interpretability because natural-language explanation expands the scale of patterns communicable to humans and lets models audit models