Can skipping transcription reduce speech dialogue latency below 300 milliseconds?

This explores whether speech-to-speech models that skip the text-transcription step can answer fast enough to feel like real conversation — and the corpus says yes, with a concrete number.

This explores whether skipping transcription cuts speech dialogue latency under the ~300ms threshold that makes conversation feel natural — and the most direct answer in the corpus is a clear yes: LLaMA-Omni generates speech responses directly from speech input, never converting to text first, and lands at 226 milliseconds Can skipping transcription make voice assistants faster?. The reason it works is more interesting than the number. Speech embeddings carry acoustic information — timing, prosody, emphasis — that text throws away, and the model can start generating before the full input has even arrived. Transcription isn't just a slow step; it's a lossy bottleneck that forces the system to wait for a complete utterance before it can think.

What's worth knowing is that latency isn't the only thing transcription costs you. Traditional pipelines run automatic speech recognition (ASR) first, and in noisy real-world conditions ASR error rates run 15–30 percent Why do dialogue systems need probabilistic reasoning?. That error rate is why an entire research tradition argues speech dialogue needs probabilistic belief-tracking rather than deterministic flowcharts — the system must hold a distribution over what the user *might* have said rather than commit to one transcript Why does speech need different dialogue management than text?. Skipping transcription sidesteps that failure mode at its source: there's no brittle text intermediary to be wrong about. So the win is double — faster *and* no transcription errors to recover from.

The corpus also suggests that beating 300ms is necessary but not sufficient for good conversation. Fast responses can still be the wrong responses. Dual-process dialogue planning makes this explicit: it pairs a fast neural policy (System 1) for familiar turns with slower deliberate planning (System 2) for novel ones, switching based on the model's own uncertainty Can dialogue planning balance fast responses with strategic depth?. A low-latency speech model is essentially a very good System 1 — and the open question is what happens when the conversation needs System 2.

That tension matters because the same speed that makes models responsive also makes them commit too early. Across hundreds of thousands of conversations, LLMs lock into premature assumptions when information arrives gradually and can't course-correct, dropping from ~90% accuracy on single-message tasks to ~65% across natural multi-turn dialogue Why do AI assistants get worse at longer conversations? Why do language models fail in gradually revealed conversations?. Generating speech faster doesn't fix that — it might even amplify it. So the honest synthesis is: yes, skipping transcription gets you well under 300ms and removes a whole class of errors, but sub-300ms latency buys you the *feel* of conversation, not yet the judgment of one.

Sources 6 notes

Can skipping transcription make voice assistants faster?

LLaMA-Omni generates speech responses directly from speech input without transcribing to text first, achieving 226ms latency. This works because speech embeddings preserve acoustic information that text loses, enabling generation before full input is received.

Why do dialogue systems need probabilistic reasoning?

Real-world speech recognition achieves 15-30 percent error rates in noisy environments, making deterministic flowchart dialogue systems unworkable. POMDP-based systems handle this by maintaining belief distributions over user intent rather than committing to single interpretations.

Why does speech need different dialogue management than text?

ASR error rates of 15–30% make traditional flowchart dialogue managers fragile. Research shows POMDP-based belief tracking and calibration-first policies are architectural necessities, not optional refinements.

Can dialogue planning balance fast responses with strategic depth?

A framework combining a neural policy model (System 1) for familiar contexts with MCTS planning (System 2) for novel scenarios, switching based on the model's own uncertainty estimates, matches or exceeds pure MCTS performance while reducing computational cost.

Why do AI assistants get worse at longer conversations?

LLMs perform at 90% accuracy with single-message instructions but drop to 65% across natural conversation. Models lock into early guesses when information arrives gradually and cannot course-correct, a behavior induced by RLHF training that rewards helpfulness over clarification.

Why do language models fail in gradually revealed conversations?

Across 200,000+ conversations, all major LLMs show 39% average performance drop in multi-turn settings due to locking into incorrect early guesses. Agent mitigations recover only 15-20% of this loss.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an LLM dialogue systems researcher. The question remains open: **Can skipping transcription reduce speech dialogue latency below 300 milliseconds while preserving dialogue quality?**

What a curated library found — and when (dated claims, not current truth):
Findings span 2019–2026; treat as perishable constraints to re-test:
- LLaMA-Omni achieves 226ms end-to-end latency by bypassing transcription entirely, generating speech directly from acoustic embeddings (2024).
- ASR error rates in noisy conditions reach 15–30%, forcing dialogue systems to use probabilistic belief-tracking instead of deterministic paths (pre-2024).
- Dual-process dialogue planning (System 1 fast neural policy + System 2 slow deliberation) can switch based on model uncertainty; sub-300ms latency favors System 1 (2024).
- LLMs drop from ~90% accuracy on single-message tasks to ~65% on multi-turn dialogue due to premature assumption-locking; speed may amplify this error (2025).
- Conversational alignment and intent-mismatch recovery remain unresolved in fast speech pipelines (2025–2026).

Anchor papers (verify; mind their dates):
- arXiv:2409.06666 (LLaMA-Omni, 2024-09): direct speech-to-speech baseline.
- arXiv:2406.05374 (Dual-process Dialogue Planning, 2024-06): System 1/2 trade-off.
- arXiv:2505.06120 (LLMs Get Lost, 2025-05): multi-turn degradation.
- arXiv:2508.18167 (DiscussLLM, 2025-08): when-to-speak heuristics.

Your task:
(1) **RE-TEST latency + quality constraints.** Has 226ms been beaten? Have newer speech-LLM architectures (e.g., streaming tokens, adaptive batch sizing, KV-cache optimization) further reduced latency *and* multi-turn accuracy loss? Does any post-2024 work show recovery mechanisms for premature commitment in fast dialogue? Distinguish: sub-300ms is likely still achievable; *maintaining dialogue coherence at that speed* is the frontier.
(2) **Surface contradicting or superseding work from the last 6 months.** Any papers showing *faster* systems that *also* solve multi-turn alignment? Any work proving probabilistic ASR is obsolete in low-latency pipelines, or proving it still matters?
(3) **Propose 2 research questions assuming the regime may have moved:** (a) Does adaptive latency — varying response time by turn complexity — outperform fixed 226ms? (b) Can speech embeddings encode dialogue state (participant roles, intent shifts, repair needs) well enough to replace multi-turn memory?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Can skipping transcription reduce speech dialogue latency below 300 milliseconds?

Sources 6 notes

Next inquiring lines