Do language models understand in fundamentally different ways?

Does mechanistic evidence reveal distinct tiers of understanding in LLMs—from concept recognition to factual knowledge to principled reasoning? And do these tiers coexist rather than replace each other?

Synthesis note · 2026-04-18 · sourced from MechInterp

This paper synthesizes mechanistic interpretability findings into a philosophical framework that moves beyond the binary "does AI understand?" debate. The framework proposes three hierarchical tiers:

Tier 1: Conceptual understanding — arises when a model forms "features" as directions in latent space that unify diverse manifestations of a single entity or property. This is the representational foundation: the model has learned that different surface forms connect to the same underlying concept. MI evidence: SAE features, linear probing, representation geometry studies all demonstrate this.

Tier 2: State-of-the-world understanding — arises when the model learns contingent factual connections between features and dynamically tracks changes. "Michael Jordan is a basketball player" is not just a high-probability string but a reflection of an internal model linking the Michael Jordan concept to the basketball player concept. This goes beyond association to structured knowledge representation.

Tier 3: Principled understanding — arises when the model discovers compact "circuits" that connect facts via general rules rather than memorizing each fact individually. This is the shift from knowing that to knowing why. The grokking literature provides the clearest evidence: models that transition from memorization to generalization develop circuits implementing actual algorithmic rules (e.g., modular addition via Fourier transforms).

The critical insight is that higher-tier mechanisms coexist with lower-tier heuristics rather than replacing them. A model can have principled understanding of arithmetic in one circuit while relying on pattern-matching heuristics in another. This heterogeneity means understanding is not a single binary property but a patchwork: principled in some domains, merely conceptual in others, and purely heuristic in yet others.

This has direct implications for trust and deployment. The fact that a model demonstrates principled understanding in one domain gives no guarantee that it operates at the same tier in adjacent domains. The coexistence of understanding tiers also explains why models can be simultaneously impressive and brittle: the principled circuits work reliably, but the heuristic patches fail unpredictably.

Inquiring lines that use this note as a source 69

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

16 direct connections · 152 in 2-hop network ·dense cluster Open in graph ↗

Do language models understand in fundamentally d… What happens inside models when they suddenly gene… Can LLMs understand concepts they cannot apply? Can AI pass every test while understanding nothing… Can a model be truthful without actually being hon…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

What happens inside models when they suddenly generalize? Grokking appears as an abrupt shift from memorization to generalization. But is the underlying process truly discontinuous, or does mechanistic analysis reveal continuous phases we can measure and predict?
grokking is the mechanistic signature of the transition from tier 2 (state-of-world, memorized facts) to tier 3 (principled, circuit-based understanding)
Can LLMs understand concepts they cannot apply? Explores whether large language models can correctly explain ideas while simultaneously failing to use them—and whether that combination reveals something fundamentally different from ordinary mistakes.
Potemkin understanding maps to cases where the model has tier-1 conceptual understanding (can explain) but lacks tier-3 principled understanding (cannot apply)
Can AI pass every test while understanding nothing? Explores whether neural networks can produce perfect outputs while having fundamentally broken internal representations. Asks what performance benchmarks actually measure and whether they can distinguish real understanding from fraud.
FER/imposter intelligence is a case where performance metrics cannot distinguish between tiers of understanding
Can a model be truthful without actually being honest? Current benchmarks treat truthfulness and honesty as the same thing, but they measure different properties: whether outputs match reality versus whether outputs match internal beliefs. What happens if they diverge?
the three-tier framework clarifies why: honesty requires tier-2 state-of-world understanding (tracking what the model itself believes), while truthfulness only requires that outputs match facts regardless of internal tier

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

mechanistic interpretability evidence supports three hierarchical varieties of LLM understanding — conceptual then state-of-world then principled — each tied to a distinct computational organization

Do language models understand in fundamentally different ways?

Related concepts in this collection 4

Related papers in this collection 8

Search by related questions 5