Do speech models learn the articulatory processes that produce acoustic signals?
This explores whether self-supervised speech models actually learn the physical vocal-tract mechanics that generate sound — the articulation behind the audio — rather than just memorizing language-specific sound categories.
This explores whether speech models learn the *causal machinery* that produces sound — how a vocal tract moves to make acoustics — rather than just cataloging the surface sounds of a particular language. The corpus has a direct and striking answer: yes. Self-supervised speech models appear to infer the language-agnostic physics of speech production, modeling how articulation generates the acoustic signal rather than learning a phonetic inventory for English or Mandarin specifically Do speech models learn language-specific sounds or universal physics?. The tell is multilingual transfer: if a model had only learned one language's sound categories, it wouldn't generalize across languages the way these do. And practically, this articulatory account predicts downstream task performance *better* than probing for phonetic categories does — meaning the underlying process, not the surface labels, is what the model is really representing.
What makes this interesting is that it fits a broader pattern in how these models organize knowledge — they tend to capture generative processes rather than store fixed lookup tables. Transformer residual streams, for instance, look less like an archive of retrievable facts and more like knowledge in continuous *flow* — something that exists only in the act of generation, closer to oral performance than to a database Do transformer models store knowledge or generate it continuously?. A speech model learning articulation-as-process rather than sounds-as-categories is the same instinct in a different domain: the model latches onto the mechanism that produces the data, not the catalog of outputs.
There's also a thread here about computation that happens beneath the visible surface. In language models, the real reasoning is sometimes computed in early layers and then overwritten before output — the meaningful structure lives below what the final tokens show transformers-perform-hidden-reasoning-computations-in-earlier-layers-then-overwri. Probing speech models reveals a parallel story: the articulatory "causes" are recoverable inside the representations even though the training signal was only raw audio. The model reconstructs the hidden generative process from surface observations alone, much like an SSL speech model recovers vocal-tract dynamics it was never explicitly taught.
If you want to go deeper, the speech-SSL note is the doorway — it reframes a question that sounds narrowly phonetic ("which sounds did it learn?") into a much bigger one: under what conditions does a model trained only on outputs end up reconstructing the physics that generated them? That's the thing worth carrying away. Learning the *cause* of a signal rather than the signal itself is what lets these systems travel across languages they were never specifically trained on.
Sources 3 notes
Self-supervised speech models learn the language-agnostic physics of how the vocal tract produces acoustics, not language-specific phonetic categories. This explains their multilingual transfer and predicts their downstream task performance better than phonetic probing.
Transformers organize knowledge as flowing activations rather than retrievable archives, mirroring oral cultures where knowledge exists only in performance. This explains why model knowledge is contextual, difficult to edit, and inseparable from generation.