OpenAI ★★★ Frequent Hard Classical SD Platform

O1 · Design a Webhook Delivery Platform O1 · 设计 Webhook 投递平台

Verified source经核实出处

Original prompt: 原题: "The system design problem focused on building a Webhook Delivery Platform." — OpenAI phone-screen round, LeetCode Discuss, 2026-02-12. Also reported on TeamBlind (2024-04-10) and in multiple Jobright / Interviewing.io threads. Credibility C (multiple independent candidate self-reports).

Requirements clarification需求澄清

Get the semantics straight before you draw a box. Ask the interviewer:

画框之前先把语义讲清楚。必须向面试官问

  • Are we just the delivery layer, or do we also produce/subscribe events?我们只做投递层,还是也产生/订阅事件?
  • Multi-tenant? (99% yes — shapes quotas, isolation, billing.)是否多租户?(几乎肯定——决定了配额、隔离、计费)
  • Is per-endpoint ordering required? (Forces single-consumer queues per endpoint → hot-spot risk.)是否要求 每 endpoint 顺序?(会强制单消费者 → 热点风险)
  • What counts as "delivered" — 2xx response or signed ack?「送达」的判定——2xx 即成功,还是需要签名 ACK?
  • Retry window and backoff policy (the problem explicitly tests this).重试窗口与退避策略(题目明确考察)
  • One event fanned out to multiple endpoints?一个事件是否投递到多个 endpoint?

Back-of-envelope scale规模估算

Interviewers on this question used the phrase "billions of requests", i.e. 109 delivery attempts/day~12K attempts/sec average, peak ~50K/sec. This is a write-heavy workload: ingest + attempt log dominate; reads are for the developer dashboard and debugging.

面试官强调「数十亿级请求」,即每天 109 次投递 ≈ 平均 ~12K/秒、峰值 ~50K/秒。这是写多读少:ingest + attempt 日志是主成本,读路径仅用于仪表盘/调试。

Storage math存储估算

  • Each attempt row ≈ 500 B → 500 GB / day → 15 TB / month.每条 attempt 约 500 B → 500 GB/天 → 15 TB/月
  • Keep detail log for 30 days, downsampled aggregates forever.详细日志保留 30 天,聚合数据永久保留
  • Event payloads: store once in object store (refs from rows).事件 payload 存对象存储(行表只保引用)

High-level architecture高层架构

The core pattern: queue + state machine + idempotency. Decouple "receive event" from "deliver event"; every delivery runs in an async retryable worker.

核心模式:队列 + 状态机 + 幂等。将「接收事件」与「对外投递」解耦;所有投递都由可重试的异步单元完成。

flowchart LR
  A[Producer / Client] --> B[Ingest API]
  B --> C[(Event Store)]
  B --> D[Dispatch Queue]
  D --> E[Delivery Workers]
  E --> F[Target Endpoint]
  E --> G[(Attempt Log)]
  E --> H[Retry Scheduler]
  H --> D
          

Component responsibilities组件职责

  • Ingest API: auth, rate limit, write to Event Store, enqueue dispatch task. Returns 202 immediately. 鉴权、限流、写入 Event Store、生成投递任务。立即返回 202。
  • Dispatch Queue: partitioned by endpoint_id (or tenant_id) to preserve ordering if required. Kafka or Kinesis works; a DB-backed queue is also fine at this scale if you partition well. endpoint_idtenant_id 分区保证顺序。Kafka/Kinesis 皆可,规模合适时 DB-backed queue 也行。
  • Delivery Workers: HTTP call, signing, timeout, retry, attempt log write, state update. Stateless — horizontally scale freely. HTTP 调用、签名、超时、重试、写 attempt、更新状态。无状态,可自由水平扩展。
  • Retry Scheduler: exponential backoff + jitter, max-attempt / deadline check, DLQ handoff. 指数退避 + 抖动、检查最大次数/截止时间、DLQ 交接。

API design (the minimum-viable set)API 设计(最小可用集合)

POST /v1/webhook_endpoints
  { url, secret, events[], enabled, rate_limit }
  → { endpoint_id }

POST /v1/events
  { type, payload, idempotency_key, target_endpoints[]? }
  → 202 { event_id }

GET  /v1/events/{event_id}
GET  /v1/events?since=&tenant_id=&cursor=
GET  /v1/deliveries?endpoint_id=&event_id=&cursor=
  → attempt history (timestamps, http_status, latency, error_code)

Idempotency is non-negotiable幂等性不容妥协

Apply idempotency at two points: (1) event submission (idempotency_key → dedup at ingest), (2) attempt write (server-side attempt UUID). Without both, you'll see duplicate deliveries or duplicate billing. 幂等必须在两个点生效:(1) 事件提交(idempotency_key → ingest 去重),(2) attempt 写入(服务端 attempt UUID)。缺一不可,否则会出现重复投递或重复计费。

Data model数据模型

WebhookEndpoint(
  endpoint_id PK, tenant_id, url, secret, status,
  event_filter[], rate_limit, created_at, updated_at
)
Event(
  event_id PK, tenant_id, type, payload_ref (S3 key),
  idempotency_key, status, created_at
)
DeliveryAttempt(
  attempt_id PK, event_id FK, endpoint_id FK,
  scheduled_at, started_at, finished_at,
  http_status, error_code, retry_count,
  latency_ms, response_hash
)

Index priorities:关键索引: (tenant_id, created_at) — dashboard & debug; (endpoint_id, scheduled_at) — retry scheduler; (event_id, endpoint_id) — delivery-history lookup.

Consistency / availability trade-offs一致性与可用性权衡

  • At-least-once is the industry default; let receivers dedup on X-Webhook-Event-Id. Strict exactly-once is impossible across a network boundary. 行业默认;接收方应按 X-Webhook-Event-Id 去重。跨网络边界的严格 exactly-once 物理上无法实现。
  • Per-endpoint ordering vs throughput每 endpoint 顺序 vs 吞吐: ordering requires single-flight per endpoint, which turns hot endpoints into bottlenecks. Offer ordered delivery as opt-in. 顺序要求意味着每 endpoint 单一投递者,热点 endpoint 会成瓶颈。建议把「顺序投递」做成可选开关。
  • Append-only attempt log + materialized status view追加式 attempt 日志 + 物化状态视图: at scale, use event sourcing for attempts and update the "current state" table via CDC — avoids write amplification. 大规模下用事件溯源记录 attempts,并通过 CDC 异步更新「当前状态」视图——避免写放大。

Performance bottlenecks & optimizations性能瓶颈与优化

Hot endpoint热点 endpoint

One tenant's endpoint takes 80% of traffic. Solutions: (1) endpoint-level token bucket rate limiter, (2) partition-key design that spreads hot tenants across multiple partitions, (3) batch write attempt rows. 一个租户的 endpoint 占 80% 流量。方案:(1) endpoint 级令牌桶限流,(2) 将热点租户分散到多分区的 partition key 设计,(3) 批量写 attempt。

Retry storm重试风暴

Target goes down → thousands of retries stack up. Solutions: (1) circuit breaker per domain — after N consecutive 5xx, open for T seconds, (2) error classification (4xx except 429 → no retry), (3) honor Retry-After, (4) cap global retry budget per minute. 目标宕机导致千级重试堆积。方案:(1) 按 domain 的熔断器——连续 N 次 5xx 后开路 T 秒,(2) 错误分类(4xx 除 429 外不重试),(3) 遵守 Retry-After,(4) 全局每分钟重试预算。

Connection & DNS连接与 DNS

At 50K RPS against thousands of domains: reuse HTTP keep-alive pools per host, cache DNS, set tiered timeouts (connect < TTFB < total), use HTTP/2 where targets support it. 50K RPS 跨数千 domain:按 host 复用 keep-alive 池、缓存 DNS、分层超时(connect < TTFB < total)、目标支持时用 HTTP/2。

Cost model成本模型

Always offer a formula, not a made-up number:

始终给公式而不是拍脑袋的数字:

compute_cost   ≈ attempts_per_sec × avg_request_ms / 1000 × worker_cost_per_hour
storage_cost   ≈ attempt_row_size × attempts × retention_days
egress_cost    ≈ (request_bytes + response_bytes) × attempts × $/GB
# Hidden cost:
# retries + 429s can easily double egress over a baseline month.

Observability & runbook可观测性与故障排查

  • SLIs: attempt success rate, p95 end-to-end latency, DLQ rate, per-tenant error rate.SLI:attempt 成功率、端到端 p95 延迟、DLQ 率、按租户错误率。
  • Every attempt row carries trace_id for cross-service joining.每条 attempt 携带 trace_id 便于跨服务关联。
  • Developer dashboard: search by event_id / endpoint_id, manual replay button, webhook-signature verifier.开发者仪表盘:按 event_id / endpoint_id 搜索、手动回放按钮、签名验证工具。

Follow-up questions you should be ready for你应该提前准备的追问

  1. "How do you prevent duplicate deliveries?"「如何防止重复投递?」 — Server: idempotency_key at ingest, attempt UUID at workers. Client: require receivers to dedup on X-Webhook-Event-Id. 服务端:ingest 处 idempotency_key,worker 处 attempt UUID。客户端:要求按 X-Webhook-Event-Id 去重。
  2. "Multi-tenant isolation?"「多租户如何隔离?」 — per-tenant rate limits, queue partition per tenant, separate signing keys, audit log, optional dedicated worker pools for Enterprise plans. 每租户限流、每租户队列分区、独立签名密钥、审计日志、企业版可选专用 worker 池。
  3. "How do I debug a missing delivery?"「怎么调试遗失投递?」 — attempt log has every retry with status/error; trace_id joins ingest + queue + worker; exposed via dashboard + public API. attempt 日志包含每次重试与状态;trace_id 关联 ingest + 队列 + worker;仪表盘与公开 API 均暴露。
  4. "What if a target is down for 6 hours?"「目标 endpoint 下线 6 小时怎么办?」 — delayed queue + circuit breaker; alerts to tenant after M consecutive failures; events past deadline → DLQ with manual-replay UI. 延迟队列 + 熔断;连续 M 次失败告警租户;超过 deadline 的事件进 DLQ,提供手动回放 UI。
  5. "How to support ordered delivery?"「如何支持顺序投递?」 — partition queue by endpoint_id + single-consumer per partition + lease token so retries don't race with fresh attempts. Document that this caps throughput. 按 endpoint_id 分区 + 每分区单一消费者 + lease 防止重试与新 attempt 竞争。必须说明这会限制吞吐。

Related study-guide topics相关学习手册专题