O1 · Design a Webhook Delivery Platform O1 · 设计 Webhook 投递平台
Verified source经核实出处
Original prompt: 原题: "The system design problem focused on building a Webhook Delivery Platform." — OpenAI phone-screen round, LeetCode Discuss, 2026-02-12. Also reported on TeamBlind (2024-04-10) and in multiple Jobright / Interviewing.io threads. Credibility C (multiple independent candidate self-reports).
Requirements clarification需求澄清
Get the semantics straight before you draw a box. Ask the interviewer:
画框之前先把语义讲清楚。必须向面试官问:
- Are we just the delivery layer, or do we also produce/subscribe events?我们只做投递层,还是也产生/订阅事件?
- Multi-tenant? (99% yes — shapes quotas, isolation, billing.)是否多租户?(几乎肯定——决定了配额、隔离、计费)
- Is per-endpoint ordering required? (Forces single-consumer queues per endpoint → hot-spot risk.)是否要求 每 endpoint 顺序?(会强制单消费者 → 热点风险)
- What counts as "delivered" — 2xx response or signed ack?「送达」的判定——2xx 即成功,还是需要签名 ACK?
- Retry window and backoff policy (the problem explicitly tests this).重试窗口与退避策略(题目明确考察)
- One event fanned out to multiple endpoints?一个事件是否投递到多个 endpoint?
Back-of-envelope scale规模估算
Interviewers on this question used the phrase "billions of requests", i.e. 109 delivery attempts/day ≈ ~12K attempts/sec average, peak ~50K/sec. This is a write-heavy workload: ingest + attempt log dominate; reads are for the developer dashboard and debugging.
面试官强调「数十亿级请求」,即每天 109 次投递 ≈ 平均 ~12K/秒、峰值 ~50K/秒。这是写多读少:ingest + attempt 日志是主成本,读路径仅用于仪表盘/调试。
Storage math存储估算
- Each attempt row ≈ 500 B → 500 GB / day → 15 TB / month.每条 attempt 约 500 B → 500 GB/天 → 15 TB/月
- Keep detail log for 30 days, downsampled aggregates forever.详细日志保留 30 天,聚合数据永久保留
- Event payloads: store once in object store (refs from rows).事件 payload 存对象存储(行表只保引用)
High-level architecture高层架构
The core pattern: queue + state machine + idempotency. Decouple "receive event" from "deliver event"; every delivery runs in an async retryable worker.
核心模式:队列 + 状态机 + 幂等。将「接收事件」与「对外投递」解耦;所有投递都由可重试的异步单元完成。
flowchart LR
A[Producer / Client] --> B[Ingest API]
B --> C[(Event Store)]
B --> D[Dispatch Queue]
D --> E[Delivery Workers]
E --> F[Target Endpoint]
E --> G[(Attempt Log)]
E --> H[Retry Scheduler]
H --> D
Component responsibilities组件职责
- Ingest API: auth, rate limit, write to Event Store, enqueue dispatch task. Returns 202 immediately. 鉴权、限流、写入 Event Store、生成投递任务。立即返回 202。
- Dispatch Queue:
partitioned by
endpoint_id(ortenant_id) to preserve ordering if required. Kafka or Kinesis works; a DB-backed queue is also fine at this scale if you partition well. 按endpoint_id或tenant_id分区保证顺序。Kafka/Kinesis 皆可,规模合适时 DB-backed queue 也行。 - Delivery Workers: HTTP call, signing, timeout, retry, attempt log write, state update. Stateless — horizontally scale freely. HTTP 调用、签名、超时、重试、写 attempt、更新状态。无状态,可自由水平扩展。
- Retry Scheduler: exponential backoff + jitter, max-attempt / deadline check, DLQ handoff. 指数退避 + 抖动、检查最大次数/截止时间、DLQ 交接。
API design (the minimum-viable set)API 设计(最小可用集合)
POST /v1/webhook_endpoints
{ url, secret, events[], enabled, rate_limit }
→ { endpoint_id }
POST /v1/events
{ type, payload, idempotency_key, target_endpoints[]? }
→ 202 { event_id }
GET /v1/events/{event_id}
GET /v1/events?since=&tenant_id=&cursor=
GET /v1/deliveries?endpoint_id=&event_id=&cursor=
→ attempt history (timestamps, http_status, latency, error_code)
Idempotency is non-negotiable幂等性不容妥协
Apply idempotency at two points: (1) event submission (idempotency_key → dedup at ingest), (2) attempt write (server-side attempt UUID). Without both, you'll see duplicate deliveries or duplicate billing.
幂等必须在两个点生效:(1) 事件提交(idempotency_key → ingest 去重),(2) attempt 写入(服务端 attempt UUID)。缺一不可,否则会出现重复投递或重复计费。
Data model数据模型
WebhookEndpoint(
endpoint_id PK, tenant_id, url, secret, status,
event_filter[], rate_limit, created_at, updated_at
)
Event(
event_id PK, tenant_id, type, payload_ref (S3 key),
idempotency_key, status, created_at
)
DeliveryAttempt(
attempt_id PK, event_id FK, endpoint_id FK,
scheduled_at, started_at, finished_at,
http_status, error_code, retry_count,
latency_ms, response_hash
)
Index priorities:关键索引:
(tenant_id, created_at) — dashboard & debug;
(endpoint_id, scheduled_at) — retry scheduler;
(event_id, endpoint_id) — delivery-history lookup.
Consistency / availability trade-offs一致性与可用性权衡
-
At-least-once is the industry default; let receivers dedup on
X-Webhook-Event-Id. Strict exactly-once is impossible across a network boundary. 行业默认;接收方应按X-Webhook-Event-Id去重。跨网络边界的严格 exactly-once 物理上无法实现。 - Per-endpoint ordering vs throughput每 endpoint 顺序 vs 吞吐: ordering requires single-flight per endpoint, which turns hot endpoints into bottlenecks. Offer ordered delivery as opt-in. 顺序要求意味着每 endpoint 单一投递者,热点 endpoint 会成瓶颈。建议把「顺序投递」做成可选开关。
- Append-only attempt log + materialized status view追加式 attempt 日志 + 物化状态视图: at scale, use event sourcing for attempts and update the "current state" table via CDC — avoids write amplification. 大规模下用事件溯源记录 attempts,并通过 CDC 异步更新「当前状态」视图——避免写放大。
Performance bottlenecks & optimizations性能瓶颈与优化
Hot endpoint热点 endpoint
One tenant's endpoint takes 80% of traffic. Solutions: (1) endpoint-level token bucket rate limiter, (2) partition-key design that spreads hot tenants across multiple partitions, (3) batch write attempt rows. 一个租户的 endpoint 占 80% 流量。方案:(1) endpoint 级令牌桶限流,(2) 将热点租户分散到多分区的 partition key 设计,(3) 批量写 attempt。
Retry storm重试风暴
Target goes down → thousands of retries stack up. Solutions: (1) circuit breaker per domain — after N consecutive 5xx, open for T seconds, (2) error classification (4xx except 429 → no retry), (3) honor Retry-After, (4) cap global retry budget per minute.
目标宕机导致千级重试堆积。方案:(1) 按 domain 的熔断器——连续 N 次 5xx 后开路 T 秒,(2) 错误分类(4xx 除 429 外不重试),(3) 遵守 Retry-After,(4) 全局每分钟重试预算。
Connection & DNS连接与 DNS
At 50K RPS against thousands of domains: reuse HTTP keep-alive pools per host, cache DNS, set tiered timeouts (connect < TTFB < total), use HTTP/2 where targets support it. 50K RPS 跨数千 domain:按 host 复用 keep-alive 池、缓存 DNS、分层超时(connect < TTFB < total)、目标支持时用 HTTP/2。
Cost model成本模型
Always offer a formula, not a made-up number:
始终给公式而不是拍脑袋的数字:
compute_cost ≈ attempts_per_sec × avg_request_ms / 1000 × worker_cost_per_hour
storage_cost ≈ attempt_row_size × attempts × retention_days
egress_cost ≈ (request_bytes + response_bytes) × attempts × $/GB
# Hidden cost:
# retries + 429s can easily double egress over a baseline month.
Observability & runbook可观测性与故障排查
- SLIs: attempt success rate, p95 end-to-end latency, DLQ rate, per-tenant error rate.SLI:attempt 成功率、端到端 p95 延迟、DLQ 率、按租户错误率。
- Every attempt row carries
trace_idfor cross-service joining.每条 attempt 携带trace_id便于跨服务关联。 - Developer dashboard: search by event_id / endpoint_id, manual replay button, webhook-signature verifier.开发者仪表盘:按 event_id / endpoint_id 搜索、手动回放按钮、签名验证工具。
Follow-up questions you should be ready for你应该提前准备的追问
-
"How do you prevent duplicate deliveries?"「如何防止重复投递?」
— Server: idempotency_key at ingest, attempt UUID at workers. Client: require receivers to dedup on
X-Webhook-Event-Id. 服务端:ingest 处 idempotency_key,worker 处 attempt UUID。客户端:要求按X-Webhook-Event-Id去重。 - "Multi-tenant isolation?"「多租户如何隔离?」 — per-tenant rate limits, queue partition per tenant, separate signing keys, audit log, optional dedicated worker pools for Enterprise plans. 每租户限流、每租户队列分区、独立签名密钥、审计日志、企业版可选专用 worker 池。
- "How do I debug a missing delivery?"「怎么调试遗失投递?」 — attempt log has every retry with status/error; trace_id joins ingest + queue + worker; exposed via dashboard + public API. attempt 日志包含每次重试与状态;trace_id 关联 ingest + 队列 + worker;仪表盘与公开 API 均暴露。
- "What if a target is down for 6 hours?"「目标 endpoint 下线 6 小时怎么办?」 — delayed queue + circuit breaker; alerts to tenant after M consecutive failures; events past deadline → DLQ with manual-replay UI. 延迟队列 + 熔断;连续 M 次失败告警租户;超过 deadline 的事件进 DLQ,提供手动回放 UI。
- "How to support ordered delivery?"「如何支持顺序投递?」 — partition queue by endpoint_id + single-consumer per partition + lease token so retries don't race with fresh attempts. Document that this caps throughput. 按 endpoint_id 分区 + 每分区单一消费者 + lease 防止重试与新 attempt 竞争。必须说明这会限制吞吐。