OpenAI ★★ Frequent Hard SSETTFTBackpressure

O15 · Streaming Token Response System O15 · Token 流式响应系统

Verified source经核实出处

Prompt: "Design a Streaming Token Response System." — SystemDesignHandbook. Credibility D.

Two protocol options两种协议选型

  • SSE: one-way server→client; simple; reconnect via Last-Event-ID.SSE:单向 server→client;简单;用 Last-Event-ID 重连。
  • WebSocket: bi-directional; required for tool-use / cancel mid-stream.WebSocket:双向;支持工具调用 / 流中取消。

Architecture架构

flowchart LR
  C[Client] --> GW[HTTP Gateway]
  GW --> STREAM[Stream Handler]
  STREAM --> INF[Inference Worker (GPU)]
  INF --> MOD[Inline Moderation]
  MOD --> STREAM
  STREAM --> C
  STREAM --> STATE[(Stream State Store)]

Must-cover engineering points必答工程点

  • TTFT (time-to-first-token) SLO: dominated by prefill + queue wait.TTFT(首 token 延迟)SLO:主要受 prefill + 队列等待影响。
  • Backpressure: if client is slow, buffer bounded; on overflow close stream with error.背压:客户端慢时缓冲有上限;溢出时关闭流并报错。
  • Reconnect: stream_id + last_token_index → resume from index; require server to keep N seconds of token history.重连:stream_id + last_token_index → 从索引恢复;要求服务端保留 N 秒 token 历史。
  • Inline moderation: chunked classifier every 10-20 tokens; can retract emitted tokens via [retract] event.流式 moderation:每 10-20 token 做 chunk 级分类;可通过 [retract] 事件撤回已发 token。

Related study-guide topics相关学习手册专题