OpenAI ★★ Frequent Hard State MachineCircuit Breaker

O3 · Webhook Platform with External URL Lookup (24h Retry) O3 · 依赖外部服务的 Webhook 平台(24h 重试)

Verified source经核实出处

Original prompt: "implement a webhook platform…create a webhook request. (cxid, json blob). url is queried from serviceB. retry for 24 hours." — LeetCode, 2024-10-30, screening system design. Credibility C.

What makes this different from O1/O2与 O1/O2 的不同点

The URL is not yours — it lives in ServiceB. Two risks: (1) which URL version you bind to an event, (2) ServiceB unavailability cascading into your ingest path.URL 不在你手上——它在 ServiceB。两大风险:(1) 事件绑定哪一版 URL,(2) ServiceB 不可用向 ingest 路径级联。

Key design decisions关键设计决策

  • Resolve URL at delivery time, not ingest time. Tracks ServiceB changes; cost is one extra dependency on the delivery path.在投递时查 URL,而不是 ingest 时固化。能跟随 ServiceB 变化;代价是投递路径多一次依赖。
  • Short-TTL config cache (1-5 min) to reduce ServiceB load and absorb jitter. Store resolved_url + serviceb_version in each attempt for traceability.短 TTL 配置缓存(1-5 分钟)降低 ServiceB 压力、吸收抖动。每次 attempt 记录 resolved_urlserviceb_version 以便追溯。
  • Circuit breaker on ServiceB — if degraded, delay retries, mark events blocked_on_dependency, alert.对 ServiceB 熔断——降级时延迟重试、将事件标记为 blocked_on_dependency、告警。

Architecture架构

flowchart LR
  A[Create Webhook Request] --> B[Ingest API]
  B --> Q[Queue]
  Q --> W[Worker]
  W --> S[ServiceB: Get URL]
  W --> T[Deliver HTTP]
  W --> R[Retry until 24h]
  W --> DLQ[(DLQ)]

Data model (cxid-keyed)数据模型(以 cxid 为键)

WebhookRequest(
  cxid, request_id PK, payload_ref,
  created_at, deadline_at=created_at+24h,
  status, resolved_url, serviceb_version
)
Attempt(attempt_id, request_id, resolved_url, serviceb_version, http_status, ...)

deadline_at is a strong constraint: every retry scheduler check must honor it so backlog doesn't cause infinite retries.deadline_at 是强约束:所有重试调度必须检查它,避免积压导致无限重试。

24-hour retry policy24 小时重试策略

  • Exponential backoff + jitter; within 24h, you must cover enough attempts: e.g. 1s, 2s, 4s... capped at 10–30 min.指数退避 + 抖动;24h 内必须覆盖足够次数:1s, 2s, 4s... 到 10–30 分钟封顶。
  • Error classification: DNS/timeout/5xx retry; 4xx (except 429) don't retry; 429 obey Retry-After.错误分类:DNS/超时/5xx 可重试;4xx(除 429)不重试;429 遵守 Retry-After
  • After deadline: DLQ + tenant notification + manual replay button.超过 deadline:DLQ + 通知租户 + 手动回放。

Follow-ups高频追问

  1. What if ServiceB is down? Worker uses cache + circuit breaker; re-queue with backoff; events marked blocked_on_dependency for capacity protection.ServiceB 挂了怎么办?Worker 使用缓存 + 熔断;带退避重入队;事件标记为 blocked_on_dependency 以保护容量。
  2. Idempotency across cxid? Unique key is (cxid, request_id); attempts are append-only.cxid 幂等如何做?唯一键 (cxid, request_id);attempts 追加写。

Related study-guide topics相关学习手册专题