X6 · Design Grok Voice Mode X6 · 设计 Grok 语音模式
Verified source经核实出处
Grok voice mode launched in X iOS app (2025); architecture pattern maps to OpenAI Realtime / Gemini Live. Credibility C.
Problem问题
Support real-time voice conversations where users tap the mic, speak, and hear Grok respond with low end-to-end latency (<800ms target). Handle barge-in (interrupting Grok mid-speech), poor networks, and 100k concurrent sessions globally.支持实时语音对话:用户点击麦克风、说话、听 Grok 回复,端到端延迟 <800ms。需支持打断(用户中途打断 Grok)、弱网和全球 10 万并发会话。
Architecture架构
flowchart LR M[Mobile app] -->|WebRTC Opus| GW[Voice gateway] GW --> VAD[Voice activity detect] VAD --> ASR[Streaming ASR] ASR --> LLM[Grok streaming] LLM --> TTS[Streaming TTS] TTS -->|Opus back| M M -->|user interrupts| VAD VAD -->|barge-in| LLM
Key decisions关键决策
- WebRTC with Opus for audio — built-in jitter buffer and packet-loss concealment.WebRTC + Opus——内置抖动缓冲与丢包补偿。
- Streaming ASR emits partial hypotheses every 200ms; Grok starts thinking on partial transcripts.流式 ASR 每 200ms 输出部分假设;Grok 基于部分转写即开始思考。
- Streaming TTS starts playback as soon as the first sentence boundary is detected from LLM output.流式 TTS 在 LLM 输出首个句界时立即开始播放。
- Barge-in: VAD on uplink kills the currently-playing TTS and discards in-flight LLM tokens.打断:上行 VAD 终止当前播放的 TTS 并丢弃在途的 LLM token。
- Voice servers are stateful — anchor each session to one region edge; sticky routing via cookie / connection ID.语音服务器是有状态的——每个会话绑定一个区域边缘;通过 cookie/连接 ID 做 sticky 路由。
Follow-ups追问
- Handle 30% packet loss without making Grok sound robotic.30% 丢包时如何避免 Grok 声音卡顿?
- How do you localize voice latency to <800ms for a user in Jakarta?雅加达用户的语音延迟如何压到 <800ms?