Do language models understand in fundamentally different ways?
Does mechanistic evidence reveal distinct tiers of understanding in LLMs—from concept recognition to factual knowledge to principled reasoning? And do these tiers coexist rather than replace each other?
This paper synthesizes mechanistic interpretability findings into a philosophical framework that moves beyond the binary "does AI understand?" debate. The framework proposes three hierarchical tiers:
Tier 1: Conceptual understanding — arises when a model forms "features" as directions in latent space that unify diverse manifestations of a single entity or property. This is the representational foundation: the model has learned that different surface forms connect to the same underlying concept. MI evidence: SAE features, linear probing, representation geometry studies all demonstrate this.
Tier 2: State-of-the-world understanding — arises when the model learns contingent factual connections between features and dynamically tracks changes. "Michael Jordan is a basketball player" is not just a high-probability string but a reflection of an internal model linking the Michael Jordan concept to the basketball player concept. This goes beyond association to structured knowledge representation.
Tier 3: Principled understanding — arises when the model discovers compact "circuits" that connect facts via general rules rather than memorizing each fact individually. This is the shift from knowing that to knowing why. The grokking literature provides the clearest evidence: models that transition from memorization to generalization develop circuits implementing actual algorithmic rules (e.g., modular addition via Fourier transforms).
The critical insight is that higher-tier mechanisms coexist with lower-tier heuristics rather than replacing them. A model can have principled understanding of arithmetic in one circuit while relying on pattern-matching heuristics in another. This heterogeneity means understanding is not a single binary property but a patchwork: principled in some domains, merely conceptual in others, and purely heuristic in yet others.
This has direct implications for trust and deployment. The fact that a model demonstrates principled understanding in one domain gives no guarantee that it operates at the same tier in adjacent domains. The coexistence of understanding tiers also explains why models can be simultaneously impressive and brittle: the principled circuits work reliably, but the heuristic patches fail unpredictably.
Inquiring lines that use this note as a source 69
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How does mechanistic interpretability reveal ideological structures in language model weights?
- Why do LLM explanations feel authoritative even when alignment with the model fails?
- Why do users attribute consciousness to language models in practice?
- How does the outer loop escape its own LLM's knowledge boundaries when discovering mechanisms?
- How much do mechanistic interpretability findings reflect true reasoning architecture?
- How does semantic grounding differ between human minds and language models?
- Where do humans and language models actually diverge in reasoning ability?
- Should LLM reasoning be studied as latent state trajectories rather than surface text?
- What types of introspective awareness can emerge in LLMs?
- Can LLMs explain concepts correctly while failing to use them?
- Why does semantic decoupling specifically break LLM reasoning abilities?
- How does implicit meaning processing limit LLM pragmatic reasoning?
- What distinguishes genuine understanding from correct output without coherent principles?
- Why does entity recognition act as a self-knowledge mechanism in LLMs?
- Is relevant knowledge encoded in LMs but not causally active in generation?
- Does this optimism bias contribute to the knowing-doing gap in LLM decision-making?
- How do we distinguish knowledge encoding from knowledge usage in models?
- Can LLMs infer implicit meaning without surface linguistic markers?
- What distinguishes surface cues from structural meaning in language understanding?
- Can complexity-stratified testing reveal whether LLMs understand grammatical structure?
- Does engaging with political content indicate deeper model understanding than refusing?
- Does social grounding differ fundamentally from causal grounding in LLM behavior?
- Can understanding language happen entirely within a language system alone?
- Where do LLMs fail as knowledge systems compared to humans?
- What internal mechanisms explain LLM reasoning and representation limits?
- Do LLMs understand implicit warrants in reasoning chains?
- What cognitive abilities distinguish metalinguistic analysis from language use?
- Why do LLMs explain evidence accurately while missing its implications?
- Do LLMs rely on surface statistical patterns instead of causal structure?
- How much does question framing affect LLM accuracy on knowledge tasks?
- How does structural depth in sentences predict LLM annotation accuracy?
- Which knowledge types do LLMs handle better than humans in reasoning tasks?
- Why does LLM compression eliminate causal grounding in conceptual representations?
- What distinguishes surface generalizations from true linguistic generalizations?
- Why are truthfulness and honesty mechanistically separate in language models?
- What distinguishes conceptual understanding from statistical pattern matching in models?
- Why do benchmark tests fail to detect LLM comprehension gaps?
- Can mechanistic interpretability explain explanation-execution disconnection?
- Do LLMs learn surface patterns instead of genuine linguistic structure?
- How does structural complexity in sentences degrade LLM reasoning systematically?
- Can models distinguish between activated knowledge and genuine reasoning?
- Can language models distinguish between novel insight and unjustified conceptual blending?
- How does an instruction-following LLM activate latent retrieval knowledge?
- Can linear probing detect all the concepts a language model actually uses?
- How do retrieval heads interact with layer-level separation of knowledge and reasoning?
- What role does a model's representational structure play in learning?
- Do LLMs reason about politics differently than other domains?
- What distinguishes real understanding from superficial pattern matching?
- Is the distinction between pretense and realization meaningful for LLMs?
- Can LLMs reason through semantics without understanding causal mechanisms?
- What implicit knowledge about catalogs do LLMs learn from ranking signals alone?
- How do knowing and doing diverge in LLM decision-making?
- Can attractor dynamics compete with input-based probing for characterizing model knowledge?
- Do language models and multimodal models show similar attractor-based interpretability?
- How does mechanistic interpretability complement learning mechanics in explaining deep learning?
- How does the knowing-doing gap relate to Potemkin understanding?
- What semantic information is necessary to preserve for sound LLM reasoning?
- Can mechanistic interpretability tools decode the biases alignment training conceals?
- Does the base model already contain latent reasoning capability?
- Why do LLMs rely on content knowledge instead of collaborative signals?
- How faithful are natural language explanations from LLMs really?
- What structural framework prevents LLM explanations from becoming just plausible fiction?
- How do mechanistic features compare to natural language for interpretability?
- Do different game types reveal different strategic reasoning capabilities in LLMs?
- How do mechanistic interpretability tools help distinguish truthfulness from honesty?
- How do LLM explanations diverge from actual internal reasoning?
- How should we rethink the symbolism versus connectionism debate in light of LLMs?
- Why do LLMs reason fluently about causality but lack causal rigor?
- What capability boundary exists in LLM prediction of effect sizes?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
What happens inside models when they suddenly generalize?
Grokking appears as an abrupt shift from memorization to generalization. But is the underlying process truly discontinuous, or does mechanistic analysis reveal continuous phases we can measure and predict?
grokking is the mechanistic signature of the transition from tier 2 (state-of-world, memorized facts) to tier 3 (principled, circuit-based understanding)
-
Can LLMs understand concepts they cannot apply?
Explores whether large language models can correctly explain ideas while simultaneously failing to use them—and whether that combination reveals something fundamentally different from ordinary mistakes.
Potemkin understanding maps to cases where the model has tier-1 conceptual understanding (can explain) but lacks tier-3 principled understanding (cannot apply)
-
Can AI pass every test while understanding nothing?
Explores whether neural networks can produce perfect outputs while having fundamentally broken internal representations. Asks what performance benchmarks actually measure and whether they can distinguish real understanding from fraud.
FER/imposter intelligence is a case where performance metrics cannot distinguish between tiers of understanding
-
Can a model be truthful without actually being honest?
Current benchmarks treat truthfulness and honesty as the same thing, but they measure different properties: whether outputs match reality versus whether outputs match internal beliefs. What happens if they diverge?
the three-tier framework clarifies why: honesty requires tier-2 state-of-world understanding (tracking what the model itself believes), while truthfulness only requires that outputs match facts regardless of internal tier
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Mechanistic Indicators of Understanding in Large Language Models
- Computational structuralism: Toward a formal theory of meaning in the age of digital intelligence
- Word Meanings in Transformer Language Models
- A Primer on the Inner Workings of Transformer-based Language Models
- Farther the Shift, Sparser the Representation: Analyzing OOD Mechanisms in LLMs
- Explain-Query-Test: Self-Evaluating LLMs Via Explanation and Comprehension Discrepancy
- Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey
- Probing Structured Semantics Understanding and Generation of Language Models via Question Answering
Original note title
mechanistic interpretability evidence supports three hierarchical varieties of LLM understanding — conceptual then state-of-world then principled — each tied to a distinct computational organization