O15 · Streaming Token Response System O15 · Token 流式响应系统
Verified source经核实出处
Prompt: "Design a Streaming Token Response System." — SystemDesignHandbook. Credibility D.
Two protocol options两种协议选型
- SSE: one-way server→client; simple; reconnect via
Last-Event-ID.SSE:单向 server→client;简单;用Last-Event-ID重连。 - WebSocket: bi-directional; required for tool-use / cancel mid-stream.WebSocket:双向;支持工具调用 / 流中取消。
Architecture架构
flowchart LR C[Client] --> GW[HTTP Gateway] GW --> STREAM[Stream Handler] STREAM --> INF[Inference Worker (GPU)] INF --> MOD[Inline Moderation] MOD --> STREAM STREAM --> C STREAM --> STATE[(Stream State Store)]
Must-cover engineering points必答工程点
- TTFT (time-to-first-token) SLO: dominated by prefill + queue wait.TTFT(首 token 延迟)SLO:主要受 prefill + 队列等待影响。
- Backpressure: if client is slow, buffer bounded; on overflow close stream with error.背压:客户端慢时缓冲有上限;溢出时关闭流并报错。
- Reconnect: stream_id + last_token_index → resume from index; require server to keep N seconds of token history.重连:stream_id + last_token_index → 从索引恢复;要求服务端保留 N 秒 token 历史。
- Inline moderation: chunked classifier every 10-20 tokens; can retract emitted tokens via
[retract]event.流式 moderation:每 10-20 token 做 chunk 级分类;可通过[retract]事件撤回已发 token。