Can natural language explanations redefine what interpretability means?

Does the ability of LLMs to explain patterns in natural language fundamentally expand the scope and complexity of what humans can understand about AI systems, compared to traditional interpretability methods?

Synthesis note · 2026-06-03 · sourced from Evaluations

Interpretable ML grew around inherently-interpretable models (sparse linear, GAMs, trees) and post-hoc techniques (feature importance, visualization, distillation). This position paper argues LLMs change the game: their capacity to explain in natural language expands the scale and complexity of patterns that can be conveyed to a human, so interpretability can be redefined with a far more ambitious scope — including using LLMs to audit LLMs themselves. The cost is new failure modes: hallucinated explanations and immense compute. The two priorities it names — using LLMs to directly analyze new datasets (knowledge discovery) and to generate interactive explanations — reframe interpretability from "inspect the model" toward "converse about the model and the data."

The keeper is the reframing of the medium: explanation in natural language is not just a nicer output format but a capacity expansion — it lets humans receive more complex patterns than feature attributions or saliency maps can carry. The catch is that this is exactly where faithfulness risk concentrates.

This sits at the optimistic pole of the vault's explanation thread, and it must be read against the cautionary results: Can LLM explanations actually help humans predict model behavior? shows NL explanations can feel trustworthy without being predictive, and Can cognitive science methods unlock how LLMs actually work? supplies the structure this ambition needs to avoid collapsing into plausible-but-unfaithful narration.

Inquiring lines that use this note as a source 1

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

How do mechanistic features compare to natural language for interpretability?

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

14 direct connections · 127 in 2-hop network ·dense cluster Open in graph ↗

Can natural language explanations redefine what … Can LLM explanations actually help humans predict … Can cognitive science methods unlock how LLMs actu… Can dictionary learning scale to production langua…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can LLM explanations actually help humans predict model behavior? Do model explanations enable users to accurately simulate how the model will behave on related inputs? This matters because it determines whether explanations genuinely improve human understanding or just create an illusion of understanding.
the cautionary counterweight: NL explanations can be plausible yet unfaithful
Can cognitive science methods unlock how LLMs actually work? Does Marr's three-level framework—developed to understand biological minds—offer interpretability researchers the structured methodology they need to decode opaque language models?
the structure this ambitious NL-interpretability scope needs
Can dictionary learning scale to production language models? Sparse autoencoders recovered interpretable features from toy models, but scaling to real production systems like Claude remains uncertain. This matters because interpretability at scale is foundational for AI safety work.
the mechanistic-feature route to interpretability, complementary to the NL-explanation route

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

LLMs redefine interpretability because natural-language explanation expands the scale of patterns communicable to humans and lets models audit models

Can natural language explanations redefine what interpretability means?

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 4