Can articulatory inversion serve as a window into what speech models have learned?

This reads articulatory inversion — reconstructing the vocal-tract movements behind a sound — as an interpretability probe: a way to ask what a speech model actually represents under the hood, and the corpus suggests it works because the model already learned the articulation, not just the acoustics.

This explores whether articulatory inversion — recovering how the mouth and vocal tract moved to produce a sound — can act as a window into what speech models have internally learned, rather than just a speech-engineering trick. The most direct evidence in the collection says yes, and for a surprising reason: self-supervised speech models appear to infer the causal articulatory processes that generate acoustics, not language-specific phonetic categories Do speech models learn language-specific sounds or universal physics?. If a model has secretly reconstructed the physics of the vocal tract, then probing it for articulation isn't testing whether it can do a task — it's reading out a representation it built on its own. Notably, that work found articulatory probing predicts downstream performance *better* than phonetic probing, which is the real claim hiding in your question: the better window is the one that matches what the model actually encoded, not the labels humans find intuitive.

That reframes inversion as a member of a larger family of interpretability moves the corpus keeps returning to — the search for the unit of analysis that reveals a model's internal commitments. In language models, sparse autoencoders surfaced an entity-recognition mechanism that the model uses to track its own knowledge and steer hallucination versus refusal Do models know what they don't know?. The parallel is tight: in both cases the informative probe is causal (it steers behavior or generates the signal) rather than correlational. Articulatory inversion is the speech-domain version of finding that causal latent variable.

But a window only shows you what's behind it, and the collection has a caution about assuming there's a stable object back there at all. Transformers may transmit knowledge as continuous flow rather than fixed storage, closer to oral performance than to a retrievable archive Do transformer models store knowledge or generate it continuously? — which means an inverted articulatory trajectory might be a snapshot of a process, not a lookup of a stored phoneme. Relatedly, hidden states reorganize under pressure: representations sparsify adaptively when a task drifts out of distribution Do language models sparsify their activations under difficult tasks?. So what inversion reveals could shift depending on whether the input is familiar speech or an accent the model has never heard — the window's view changes with the weather.

There's also a cross-domain warning worth carrying over. Probing assumes the model committed to one thing you can read out, but language models can hold a superposition and only sample a particular character at generation time Do large language models actually commit to a single character?. Ported to speech, this suggests a single acoustic frame might be consistent with several articulatory configurations the model is implicitly holding open — so inversion may recover a distribution over vocal-tract states rather than the one true gesture. That's not a failure of the window; it's the window honestly showing you that the model's knowledge is probabilistic.

The thing you may not have known you wanted to know: articulatory inversion is interesting *not* because it lets us decode speech better, but because it's an accidental confession. A model trained only on raw audio, never told about lips or tongues, reconstructs the bodily mechanics anyway — which says the most learnable structure in speech is the physical act that made it. The window doesn't just show what the model learned; it shows what was there to be learned in the first place.

Sources 5 notes

Do speech models learn language-specific sounds or universal physics?

Self-supervised speech models learn the language-agnostic physics of how the vocal tract produces acoustics, not language-specific phonetic categories. This explains their multilingual transfer and predicts their downstream task performance better than phonetic probing.

Do models know what they don't know?

Sparse autoencoders revealed that language models develop causal mechanisms for detecting whether they know facts about entities. These mechanisms actively steer both hallucination and refusal behavior, and persist from base models into finetuned chat versions.

Do transformer models store knowledge or generate it continuously?

Transformers organize knowledge as flowing activations rather than retrievable archives, mirroring oral cultures where knowledge exists only in performance. This explains why model knowledge is contextual, difficult to edit, and inseparable from generation.

Do language models sparsify their activations under difficult tasks?

As task difficulty increases, LLM hidden states become substantially sparser in a localized, systematic way that correlates with task unfamiliarity and reasoning load. This sparsification acts as a selective filter stabilizing performance under OOD shift rather than a failure mode.

Do large language models actually commit to a single character?

Shanahan's 20-questions test shows LLMs maintain a superposition of consistent objects or characters and sample from that distribution at generation time. Regenerating the same response yields different outputs, each consistent with prior context, proving no fixed commitment exists.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a speech-model interpretability researcher. The question: Can articulatory inversion serve as a *causal* window into what self-supervised speech models have learned—not just as an engineering tool, but as a probe that reveals the model's internal commitments?

What a curated library found — and when (dated claims, not current truth):
Findings span October 2023 to March 2026. Key constraints:
• Self-supervised speech models infer causal articulatory processes (vocal-tract kinematics) from raw audio, and articulatory probing predicts downstream task performance *better* than phonetic probing (~2310.10788, Oct 2023).
• Transformer residual streams transmit knowledge as continuous flow rather than stable storage—so an inverted articulatory trajectory may be a snapshot of process, not a retrieved phoneme (~2024-04).
• Hidden-state representations sparsify adaptively under out-of-distribution shift; what inversion reveals may drift if input is unfamiliar speech or unseen accent (~2603.03415, Mar 2026).
• Language models can hold superpositions of representations and sample only at generation time; a single acoustic frame may be consistent with multiple articulatory configurations the model implicitly holds open (~2024-02).
• Entity-recognition mechanisms in LLMs act as causal self-knowledge probes, steering hallucination vs. refusal—a parallel interpretability pattern across modalities (~2411.14257, Nov 2024).

Anchor papers (verify; mind their dates):
• arXiv:2310.10788 (Oct 2023) — Self-supervised models infer universal articulatory kinematics.
• arXiv:2411.14257 (Nov 2024) — Entity-recognition as causal probe in language models.
• arXiv:2603.03415 (Mar 2026) — OOD sparsification in LLM representations.
• arXiv:2024-02 (Feb 2024) — Superposition and sampling in character-level LLM knowledge.

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For articulatory inversion: (a) Have newer speech models (e.g., Whisper-v3 or later self-supervised variants) been probed for articulatory information, and does the superiority over phonetic probing hold? (b) Do post-2026 findings on knowledge flow in transformers (e.g., arXiv:2504.09522 on knowledge permeation, arXiv:2507.20252 on post-completion learning) change the interpretation of whether inversion recovers stored representations or dynamic processes? (c) Has the OOD sparsification finding (~2603.03415) been tested on speech models specifically—does articulatory inversion become less reliable out-of-domain? Separate the durable question (whether articulatory structure is learnable from raw audio) from perishable limitations (whether current probing methods can cleanly read it out).

(2) **Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months.** Look for: (a) papers falsifying the claim that articulatory features are the primary structure speech models learn; (b) new probing methods that outperform articulatory inversion; (c) evidence that speech models learn language-specific phonetic categories *despite* raw-audio training, undermining the universality claim from ~2310.10788.

(3) **Propose 2 research questions that ASSUME the regime may have moved:** (a) If speech models now learn structured, multi-scale temporal patterns (not just static articulation), how should inversion be adapted to recover *trajectory* distributions rather than point estimates? (b) If post-completion learning and continual adaptation are now standard, does articulatory inversion reveal *what the model learned at initialization* vs. *what it learned from in-context examples*—and how do these differ?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Can articulatory inversion serve as a window into what speech models have learned?

Sources 5 notes

Next inquiring lines