xAI ★ Emerging Medium VoiceRealtimeWebRTC

X6 · Design Grok Voice Mode X6 · 设计 Grok 语音模式

Verified source经核实出处

Grok voice mode launched in X iOS app (2025); architecture pattern maps to OpenAI Realtime / Gemini Live. Credibility C.

Problem问题

Support real-time voice conversations where users tap the mic, speak, and hear Grok respond with low end-to-end latency (<800ms target). Handle barge-in (interrupting Grok mid-speech), poor networks, and 100k concurrent sessions globally.支持实时语音对话:用户点击麦克风、说话、听 Grok 回复,端到端延迟 <800ms。需支持打断(用户中途打断 Grok)、弱网和全球 10 万并发会话。

Architecture架构

flowchart LR
  M[Mobile app] -->|WebRTC Opus| GW[Voice gateway]
  GW --> VAD[Voice activity detect]
  VAD --> ASR[Streaming ASR]
  ASR --> LLM[Grok streaming]
  LLM --> TTS[Streaming TTS]
  TTS -->|Opus back| M
  M -->|user interrupts| VAD
  VAD -->|barge-in| LLM

Key decisions关键决策

  • WebRTC with Opus for audio — built-in jitter buffer and packet-loss concealment.WebRTC + Opus——内置抖动缓冲与丢包补偿。
  • Streaming ASR emits partial hypotheses every 200ms; Grok starts thinking on partial transcripts.流式 ASR 每 200ms 输出部分假设;Grok 基于部分转写即开始思考。
  • Streaming TTS starts playback as soon as the first sentence boundary is detected from LLM output.流式 TTS 在 LLM 输出首个句界时立即开始播放。
  • Barge-in: VAD on uplink kills the currently-playing TTS and discards in-flight LLM tokens.打断:上行 VAD 终止当前播放的 TTS 并丢弃在途的 LLM token。
  • Voice servers are stateful — anchor each session to one region edge; sticky routing via cookie / connection ID.语音服务器是有状态的——每个会话绑定一个区域边缘;通过 cookie/连接 ID 做 sticky 路由。

Follow-ups追问

  • Handle 30% packet loss without making Grok sound robotic.30% 丢包时如何避免 Grok 声音卡顿?
  • How do you localize voice latency to <800ms for a user in Jakarta?雅加达用户的语音延迟如何压到 <800ms?

Related study-guide topics相关学习手册专题