OpenAI ★★ Frequent Hard WebRTCStreamingVoice

O29 · Design a Realtime Voice Backend (OpenAI Realtime API) O29 · 设计实时语音后端(OpenAI Realtime API)

Verified source经核实出处

OpenAI Realtime API (docs) launched 2024-10. Blind 2024-Q4 reports. Credibility A.

Architecture架构

flowchart LR
  C[Client] <--WebRTC/WS--> GW[Edge Gateway]
  GW --> VAD[VAD / barge-in]
  VAD --> ASR[Streaming ASR]
  ASR --> LLM[Streaming LLM]
  LLM --> TTS[Streaming TTS]
  TTS --> GW
  GW <--> SESS[(Session State)]

Key decisions关键决策

  • **WebRTC for audio**: UDP-based, handles jitter; WS fallback for restrictive networks.**音频走 WebRTC**:UDP 抗抖动;受限网络回退 WS。
  • **Barge-in**: server VAD on user channel; cancel LLM generation + TTS on user speech.**打断检测**:服务端 VAD;说话即取消 LLM 生成与 TTS 播放。
  • **Streaming at every stage**: ASR partials every 100 ms; LLM pipes tokens into TTS.**全链路流式**:ASR 每 100 ms 出 partial;LLM 直接喂 TTS。
  • **Session affinity**: sticky routing to same GPU for KV cache; replay last N turns on reroute.**会话亲和**:粘性 GPU 路由复用 KV cache;迁移时重放最近 N 轮。

Follow-ups追问

  • First-audio budget < 300 ms? ASR 150 + LLM TTFT 80 + TTS 40 + network 30.首音 300 ms 预算?ASR 150 + LLM TTFT 80 + TTS 40 + 网络 30。
  • Network blip? jitter buffer + seq reassembly; tolerate 1 s gap.网络抖动?jitter buffer + 序号重组;容忍 1 s 断流。

Related study-guide topics相关学习手册专题