Do language models and multimodal models show similar attractor-based interpretability?
This reads as a comparison question — asking whether the structured, geometry-or-dynamics-based ways we read the internals of language models also show up in multimodal models — and the honest answer up front is that this collection has rich material on the language-model side but almost nothing on multimodal interpretability, so the comparison itself can't be drawn from here.
This explores whether "attractor-based" interpretability — the idea that a model's internals settle into stable, readable structures (geometries, flows, basins) you can decode — looks the same across text-only and multimodal models. Worth saying plainly: the corpus has a lot on how language-model internals are structured, but it carries no multimodal interpretability work. So the direct comparison you're asking for isn't something this collection can answer. What it can do is show you, in detail, what the language-model half of that comparison actually looks like — which turns out to be more dynamical and structured than the word "attractor" might lead you to expect.
On the language side, several notes converge on a picture of internals that are organized rather than arbitrary. One line of work finds that transformer residual streams carry knowledge as continuous *flow* rather than fixed storage — closer to oral performance than to a database, which is why model knowledge is contextual and hard to edit (Do transformer models store knowledge or generate it continuously?). Another finds that activations *sparsify* in a localized, systematic way as tasks get harder or stranger — an adaptive filter that stabilizes behavior under distribution shift rather than a breakdown (Do language models sparsify their activations under difficult tasks?). Both are closer to the dynamical-systems intuition behind "attractors" — states the model settles into — than to static feature-spotting.
The more classically interpretable findings add geometry and hierarchy. A polar-coordinate probe shows syntax encoded through both distance and angle between embeddings, meaning networks spontaneously grow structured, almost symbolic geometry (How do language models encode syntactic relations geometrically?). And mechanistic work argues understanding comes in tiers — features as directions, factual connections, and compact circuits — with higher tiers layered on top of, not replacing, lower-tier heuristics, producing a patchwork rather than one clean mechanism (Do language models understand in fundamentally different ways?). That patchwork matters for your question: even within one modality, interpretability isn't a single phenomenon, so "similar across modalities" was never going to be a yes/no.
Where the corpus touches multimodality at all, it does so from the opposite direction — not interpreting multimodal models, but arguing why text alone is limiting. The Plato's-cave framing holds that text strips physics, geometry, and causality, leaving models manipulating ungrounded symbols (Are text-only language models fundamentally limited by abstraction?), and Bender & Koller argue meaning requires linking expressions to communicative intent that form-only training can't reach (Can language models learn meaning from text patterns alone?). These imply multimodal grounding would change *what* gets represented — but they say nothing about whether the internal attractor-like structure would look the same.
The thing you might not have expected to learn: the interesting open question isn't really "text vs. multimodal" symmetry — it's that the language-model internals already split between two interpretability stories, a geometric/circuit one and a flow/dynamics one (How do language models encode syntactic relations geometrically?, Do transformer models store knowledge or generate it continuously?). If you want to chase the multimodal comparison, that's the gap to fill in the collection; if you want to go deeper now, those two notes plus the tiered-understanding one are the doorways.
Sources 6 notes
Transformers organize knowledge as flowing activations rather than retrievable archives, mirroring oral cultures where knowledge exists only in performance. This explains why model knowledge is contextual, difficult to edit, and inseparable from generation.
As task difficulty increases, LLM hidden states become substantially sparser in a localized, systematic way that correlates with task unfamiliarity and reasoning load. This sparsification acts as a selective filter stabilizing performance under OOD shift rather than a failure mode.
The Polar Probe shows LLMs represent syntactic type and direction through both distance and angular position between embeddings, nearly doubling accuracy over distance-only methods. This demonstrates neural networks spontaneously learn structured, symbolic-compatible geometry.
Mechanistic interpretability reveals conceptual understanding (features as directions), state-of-world understanding (factual connections), and principled understanding (compact circuits). Crucially, higher tiers coexist with lower-tier heuristics rather than replacing them, creating a patchwork of capabilities.
Text strips the physics, geometry, and causality present in reality, forcing language models to manipulate symbols without grounding in their source dynamics. This creates predictable failure modes in physical, geometric, and causal reasoning that multimodal training could address.
Bender & Koller argue that meaning requires the relation between expressions and communicative intents. Since LLMs are trained only on form-to-form prediction with no access to shared attention or intent, they cannot reconstruct the meaning that grounds language.