OpenAI ★★ Frequent Hard SSETTFTBackpressure

O15 · Streaming Token Response System O15 · Token 流式响应系统

Verified source经核实出处

Prompt: "Design a Streaming Token Response System." — SystemDesignHandbook. Credibility D.

Two protocol options两种协议选型

SSE: one-way server→client; simple; reconnect via Last-Event-ID.SSE：单向 server→client；简单；用 Last-Event-ID 重连。
WebSocket: bi-directional; required for tool-use / cancel mid-stream.WebSocket：双向；支持工具调用 / 流中取消。

Architecture架构

flowchart LR
  C[Client] --> GW[HTTP Gateway]
  GW --> STREAM[Stream Handler]
  STREAM --> INF[Inference Worker (GPU)]
  INF --> MOD[Inline Moderation]
  MOD --> STREAM
  STREAM --> C
  STREAM --> STATE[(Stream State Store)]

Must-cover engineering points必答工程点

TTFT (time-to-first-token) SLO: dominated by prefill + queue wait.TTFT（首 token 延迟）SLO：主要受 prefill + 队列等待影响。
Backpressure: if client is slow, buffer bounded; on overflow close stream with error.背压：客户端慢时缓冲有上限；溢出时关闭流并报错。
Reconnect: stream_id + last_token_index → resume from index; require server to keep N seconds of token history.重连：stream_id + last_token_index → 从索引恢复；要求服务端保留 N 秒 token 历史。
Inline moderation: chunked classifier every 10-20 tokens; can retract emitted tokens via [retract] event.流式 moderation：每 10-20 token 做 chunk 级分类；可通过 [retract] 事件撤回已发 token。

O15 · Streaming Token Response System O15 · Token 流式响应系统

Verified source经核实出处

Two protocol options两种协议选型

Architecture架构

Must-cover engineering points必答工程点

Related study-guide topics相关学习手册专题