Can skipping transcription reduce speech dialogue latency below 300 milliseconds?
This explores whether speech-to-speech models that skip the text-transcription step can answer fast enough to feel like real conversation — and the corpus says yes, with a concrete number.
This explores whether skipping transcription cuts speech dialogue latency under the ~300ms threshold that makes conversation feel natural — and the most direct answer in the corpus is a clear yes: LLaMA-Omni generates speech responses directly from speech input, never converting to text first, and lands at 226 milliseconds Can skipping transcription make voice assistants faster?. The reason it works is more interesting than the number. Speech embeddings carry acoustic information — timing, prosody, emphasis — that text throws away, and the model can start generating before the full input has even arrived. Transcription isn't just a slow step; it's a lossy bottleneck that forces the system to wait for a complete utterance before it can think.
What's worth knowing is that latency isn't the only thing transcription costs you. Traditional pipelines run automatic speech recognition (ASR) first, and in noisy real-world conditions ASR error rates run 15–30 percent Why do dialogue systems need probabilistic reasoning?. That error rate is why an entire research tradition argues speech dialogue needs probabilistic belief-tracking rather than deterministic flowcharts — the system must hold a distribution over what the user *might* have said rather than commit to one transcript Why does speech need different dialogue management than text?. Skipping transcription sidesteps that failure mode at its source: there's no brittle text intermediary to be wrong about. So the win is double — faster *and* no transcription errors to recover from.
The corpus also suggests that beating 300ms is necessary but not sufficient for good conversation. Fast responses can still be the wrong responses. Dual-process dialogue planning makes this explicit: it pairs a fast neural policy (System 1) for familiar turns with slower deliberate planning (System 2) for novel ones, switching based on the model's own uncertainty Can dialogue planning balance fast responses with strategic depth?. A low-latency speech model is essentially a very good System 1 — and the open question is what happens when the conversation needs System 2.
That tension matters because the same speed that makes models responsive also makes them commit too early. Across hundreds of thousands of conversations, LLMs lock into premature assumptions when information arrives gradually and can't course-correct, dropping from ~90% accuracy on single-message tasks to ~65% across natural multi-turn dialogue Why do AI assistants get worse at longer conversations? Why do language models fail in gradually revealed conversations?. Generating speech faster doesn't fix that — it might even amplify it. So the honest synthesis is: yes, skipping transcription gets you well under 300ms and removes a whole class of errors, but sub-300ms latency buys you the *feel* of conversation, not yet the judgment of one.
Sources 6 notes
LLaMA-Omni generates speech responses directly from speech input without transcribing to text first, achieving 226ms latency. This works because speech embeddings preserve acoustic information that text loses, enabling generation before full input is received.
Real-world speech recognition achieves 15-30 percent error rates in noisy environments, making deterministic flowchart dialogue systems unworkable. POMDP-based systems handle this by maintaining belief distributions over user intent rather than committing to single interpretations.
ASR error rates of 15–30% make traditional flowchart dialogue managers fragile. Research shows POMDP-based belief tracking and calibration-first policies are architectural necessities, not optional refinements.
A framework combining a neural policy model (System 1) for familiar contexts with MCTS planning (System 2) for novel scenarios, switching based on the model's own uncertainty estimates, matches or exceeds pure MCTS performance while reducing computational cost.
LLMs perform at 90% accuracy with single-message instructions but drop to 65% across natural conversation. Models lock into early guesses when information arrives gradually and cannot course-correct, a behavior induced by RLHF training that rewards helpfulness over clarification.
Across 200,000+ conversations, all major LLMs show 39% average performance drop in multi-turn settings due to locking into incorrect early guesses. Agent mitigations recover only 15-20% of this loss.