What makes internal embeddings useful as multimodal input for language model training?

This explores why representations pulled from inside a model — embeddings, hidden states, latent vectors — might work better than raw signals when you feed them into a language model as training input, especially across modalities like speech.

This is really asking what an internal embedding has *already done* to a signal that makes it a better input than the raw thing it came from. The corpus doesn't have a paper aimed squarely at multimodal embedding-as-input, but several notes circle the same territory from different angles, and together they make the case sharply. The strongest single piece of evidence comes from speech: self-supervised speech models don't learn language-specific sound categories, they infer the causal articulatory processes — the physics of how a vocal tract produces acoustics — that generate the audio in the first place Do speech models learn language-specific sounds or universal physics?. That's exactly why their embeddings are useful as input: they've distilled the *generative structure* behind a modality, not its surface form, which is why they transfer across languages and even predict downstream performance better than direct probing. An embedding is valuable as input precisely to the degree it has compressed cause rather than appearance.

The latent-thought work pushes the same idea in the other direction — instead of importing embeddings from another model, it treats internal latent vectors as a first-class, trainable input channel. Latent-Thought Language Models add a scaling dimension that's independent of parameter count, coupling fast local learning over the latent vectors with slow global decoder learning Can latent thought vectors scale language models beyond parameters?. The lesson that generalizes to multimodal inputs: a compact internal vector can carry information that scales reasoning without bloating the model, which is the whole economic argument for using embeddings as input rather than re-tokenizing raw data.

But the corpus is just as insistent that internal embeddings are not neutral carriers — they smuggle in the priors of whatever produced them. Hidden states sparsify in a systematic, adaptive way under out-of-distribution shift Do language models sparsify their activations under difficult tasks?, so the same representation behaves differently depending on how familiar the input is. Worse, parametric associations baked into representations can override the actual input you hand the model, and only causal intervention in the representations — not prompting — fixes it Why do language models ignore information in their context?. And those internal states encode structural bias: low-resource cultures get represented through high-resource proxies inside the model, not just in its outputs Do LLMs represent low-resource cultures through dominant cultural proxies?. So 'useful' cuts both ways — an embedding is information-dense because it's opinionated, and those opinions ride along into whatever you train on it.

The deeper, unsettled question sitting under all of this is whether an embedding can carry *meaning* as input at all. One line of work argues language models operationalize Saussure's *langue* — they compress purely relational structure from form, and that's enough for fluent generation with no external referents Can language models learn meaning without engaging the world?. The opposing view holds that meaning requires the link between expression and communicative intent, which form-only training can never recover Can language models learn meaning from text patterns alone?. Multimodal embeddings are interesting partly because they're a wager on the first view extended sideways: if a speech or vision encoder has captured the causal generative structure of its signal, then handing that vector to a language model is a way of grounding it in something beyond text — without ever leaving the world of relational representation. Whether that counts as real grounding or just a richer kind of form is the thing worth chasing next.

Sources 7 notes

Do speech models learn language-specific sounds or universal physics?

Self-supervised speech models learn the language-agnostic physics of how the vocal tract produces acoustics, not language-specific phonetic categories. This explains their multilingual transfer and predicts their downstream task performance better than phonetic probing.

Can latent thought vectors scale language models beyond parameters?

Latent-Thought Language Models achieve superior sample and parameter efficiency by coupling fast local variational learning with slow global decoder learning. This dual-rate scheme scales few-shot reasoning across both model and latent size, creating independent scaling dimensions beyond traditional parameter scaling.

Do language models sparsify their activations under difficult tasks?

As task difficulty increases, LLM hidden states become substantially sparser in a localized, systematic way that correlates with task unfamiliarity and reasoning load. This sparsification acts as a selective filter stabilizing performance under OOD shift rather than a failure mode.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Do LLMs represent low-resource cultures through dominant cultural proxies?

Mechanistic interpretability analysis reveals that low-resource cultures like Ethiopia and Algeria are structurally represented through high-resource cultural proxies in internal model states, not just output. This architectural bias persists even when models can produce correct surface-level answers.

Can language models learn meaning without engaging the world?

Research shows LLMs learn culturally situated discourse patterns by compressing relational structure from text, demonstrating that fluent language generation requires no external referents or embodied grounding.

Can language models learn meaning from text patterns alone?

Bender & Koller argue that meaning requires the relation between expressions and communicative intents. Since LLMs are trained only on form-to-form prediction with no access to shared attention or intent, they cannot reconstruct the meaning that grounds language.

What makes internal embeddings useful as multimodal input for language model training?

Sources 7 notes

Next inquiring lines