Do language models and multimodal models show similar attractor-based interpretability?

This reads as a comparison question — asking whether the structured, geometry-or-dynamics-based ways we read the internals of language models also show up in multimodal models — and the honest answer up front is that this collection has rich material on the language-model side but almost nothing on multimodal interpretability, so the comparison itself can't be drawn from here.

This explores whether "attractor-based" interpretability — the idea that a model's internals settle into stable, readable structures (geometries, flows, basins) you can decode — looks the same across text-only and multimodal models. Worth saying plainly: the corpus has a lot on how language-model internals are structured, but it carries no multimodal interpretability work. So the direct comparison you're asking for isn't something this collection can answer. What it can do is show you, in detail, what the language-model half of that comparison actually looks like — which turns out to be more dynamical and structured than the word "attractor" might lead you to expect.

On the language side, several notes converge on a picture of internals that are organized rather than arbitrary. One line of work finds that transformer residual streams carry knowledge as continuous *flow* rather than fixed storage — closer to oral performance than to a database, which is why model knowledge is contextual and hard to edit (Do transformer models store knowledge or generate it continuously?). Another finds that activations *sparsify* in a localized, systematic way as tasks get harder or stranger — an adaptive filter that stabilizes behavior under distribution shift rather than a breakdown (Do language models sparsify their activations under difficult tasks?). Both are closer to the dynamical-systems intuition behind "attractors" — states the model settles into — than to static feature-spotting.

The more classically interpretable findings add geometry and hierarchy. A polar-coordinate probe shows syntax encoded through both distance and angle between embeddings, meaning networks spontaneously grow structured, almost symbolic geometry (How do language models encode syntactic relations geometrically?). And mechanistic work argues understanding comes in tiers — features as directions, factual connections, and compact circuits — with higher tiers layered on top of, not replacing, lower-tier heuristics, producing a patchwork rather than one clean mechanism (Do language models understand in fundamentally different ways?). That patchwork matters for your question: even within one modality, interpretability isn't a single phenomenon, so "similar across modalities" was never going to be a yes/no.

Where the corpus touches multimodality at all, it does so from the opposite direction — not interpreting multimodal models, but arguing why text alone is limiting. The Plato's-cave framing holds that text strips physics, geometry, and causality, leaving models manipulating ungrounded symbols (Are text-only language models fundamentally limited by abstraction?), and Bender & Koller argue meaning requires linking expressions to communicative intent that form-only training can't reach (Can language models learn meaning from text patterns alone?). These imply multimodal grounding would change *what* gets represented — but they say nothing about whether the internal attractor-like structure would look the same.

The thing you might not have expected to learn: the interesting open question isn't really "text vs. multimodal" symmetry — it's that the language-model internals already split between two interpretability stories, a geometric/circuit one and a flow/dynamics one (How do language models encode syntactic relations geometrically?, Do transformer models store knowledge or generate it continuously?). If you want to chase the multimodal comparison, that's the gap to fill in the collection; if you want to go deeper now, those two notes plus the tiered-understanding one are the doorways.

Sources 6 notes

Do transformer models store knowledge or generate it continuously?

Transformers organize knowledge as flowing activations rather than retrievable archives, mirroring oral cultures where knowledge exists only in performance. This explains why model knowledge is contextual, difficult to edit, and inseparable from generation.

Do language models sparsify their activations under difficult tasks?

As task difficulty increases, LLM hidden states become substantially sparser in a localized, systematic way that correlates with task unfamiliarity and reasoning load. This sparsification acts as a selective filter stabilizing performance under OOD shift rather than a failure mode.

How do language models encode syntactic relations geometrically?

The Polar Probe shows LLMs represent syntactic type and direction through both distance and angular position between embeddings, nearly doubling accuracy over distance-only methods. This demonstrates neural networks spontaneously learn structured, symbolic-compatible geometry.

Do language models understand in fundamentally different ways?

Mechanistic interpretability reveals conceptual understanding (features as directions), state-of-world understanding (factual connections), and principled understanding (compact circuits). Crucially, higher tiers coexist with lower-tier heuristics rather than replacing them, creating a patchwork of capabilities.

Are text-only language models fundamentally limited by abstraction?

Text strips the physics, geometry, and causality present in reality, forcing language models to manipulate symbols without grounding in their source dynamics. This creates predictable failure modes in physical, geometric, and causal reasoning that multimodal training could address.

Can language models learn meaning from text patterns alone?

Bender & Koller argue that meaning requires the relation between expressions and communicative intents. Since LLMs are trained only on form-to-form prediction with no access to shared attention or intent, they cannot reconstruct the meaning that grounds language.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a mechanistic interpretability researcher. The question: Do language models and multimodal models exhibit similar attractor-based internal structure—stable geometric or dynamical regimes you can decode?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026. The library contains rich language-model interpretability work but NO multimodal mechanistic studies:
• Transformer residual streams transmit knowledge as *flow* (contextual, performative) not fixed storage, closer to dynamical attractors than databases (~2024–25).
• LLM activations sparsify systematically under distribution shift as an adaptive filter stabilizing behavior, not a breakdown (~2026).
• Syntax is encoded in polar-coordinate geometry—distance and angle between embeddings—showing spontaneous structured, quasi-symbolic organization (~2024–25).
• Understanding in LLMs splits hierarchically: features as directions → factual connections → compact circuits, a patchwork not a single mechanism (~2025).
• Text-only models strip physics, geometry, causality (Plato's cave), and meaning requires grounding to communicative intent, implying multimodal representation *differs* in content, not necessarily structure (~2023–24).

Anchor papers (verify; mind their dates):
• arXiv:2412.05571 (2024-12): Polar coordinates encode syntax.
• arXiv:2507.08017 (2025-07): Mechanistic understanding hierarchies.
• arXiv:2603.03415 (2026-03): OOD sparsification mechanisms.
• arXiv:2603.03276 (2026-03): Multimodal pretraining exploration.

Your task:
(1) **RE-TEST EACH CONSTRAINT.** The library claims language models show geometric + flow-based attractors but offers zero multimodal mechanistic data. For each finding above, ask: have newer vision-language or audio models (GPT-4V, Claude's multimodal, diffusion-based reasoning) been dissected at the activation level since mid-2026? Does their geometry / sparsification / hierarchy match the text story? Separate the durable question—"Do both modalities settle into interpretable stable states?"—from the perishable claim—"We only know this for text."
(2) **Surface strongest CONTRADICTING or SUPERSEDING work (last ~6 months).** Look for papers arguing attractor-based frames fail for multimodal or vice versa; or that flow ≠ geometry in newer architectures (MoE, diffusion-backed inference, multi-agent orchestration). Flag if multimodal interpretability studies have finally appeared.
(3) **Propose two research questions that assume the regime moved:** (a) If multimodal models *do* show attractor structure, does grounding to visual/physical causality *change* the geometry or only the content encoded in it? (b) If they *don't*, what architectural or training difference (cross-modal attention, contrastive loss, discrete bottleneck) breaks the attractor regime that works in text?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Do language models and multimodal models show similar attractor-based interpretability?

Sources 6 notes

Next inquiring lines