How do different speech encoder layers capture different types of gesture information?
This explores how the layers of a speech encoder split apart different kinds of information — and whether the 'high' and 'low' layers carry the semantic versus the expressive, motion-driving signal that gesture generation depends on.
This explores how the layers of a speech encoder pull apart different kinds of information, and whether that layering is what lets a model turn speech into gesture. The clearest answer in the collection comes from DeepGesture's diffusion model, which treats the speech encoder as a stack that separates two things at once: high-level semantic features (what's being said and meant) live in the upper layers, while low-level motion features (rhythm, emphasis, emotional texture) live lower down Can speech features be separated into semantic and stylistic components?. That split is doing real work — it's what lets the system generate gestures that are both contextually appropriate and emotionally expressive, and it's controllable enough that you can dial emotion independently and still have it generalize to synthetic, out-of-distribution voices.
The more surprising piece sits one level deeper. If you ask *why* speech encoder layers carry this kind of structure at all, the corpus points to what self-supervised speech models actually learn: not language-specific phonetic categories, but the causal articulatory physics of how a vocal tract produces sound Do speech models learn language-specific sounds or universal physics?. That matters for gesture, because gesture is itself articulatory — it co-arises from the same embodied, motor act of speaking. An encoder that has internalized how speech is *produced* (the timing, effort, and dynamics of the body making sound) is encoding exactly the low-level motion signal that gesture generation feeds on. The semantic/motion disentanglement in layer 1 and the articulatory grounding in layer 2 are two views of the same thing: upper layers abstract toward meaning, lower layers stay close to the bodily mechanics.
The honest caveat: this collection is thin on the specific question of a *layer-by-layer* gesture probe — there isn't a note here that walks each encoder layer and maps it to a gesture type. What it gives you instead is the load-bearing principle (semantic features high, motion features low) plus the deeper reason that principle holds (the encoder learned production physics, not just phonetic labels). If you want to go further, those two notes are the doorway — and the second one is the more counterintuitive read, because it reframes 'gesture information in a speech model' as a byproduct of the model having quietly learned anatomy rather than language.
Sources 2 notes
DeepGesture's diffusion model splits speech into high-level semantic features and low-level motion features across encoder layers, enabling emotion-guided control. This disentanglement produces gestures that are both contextually appropriate and emotionally expressive, and generalizes to out-of-distribution synthetic voices.
Self-supervised speech models learn the language-agnostic physics of how the vocal tract produces acoustics, not language-specific phonetic categories. This explains their multilingual transfer and predicts their downstream task performance better than phonetic probing.