Why webhooks are hard

A webhook platform is the outbound equivalent of a public API: your service calls customers' HTTP endpoints when events happen (Stripe on charge.succeeded, GitHub on push, OpenAI on fine-tuning job completion). It looks simple — "just POST the JSON" — but customer endpoints are the most hostile dependency in the world:

  • Endpoints go down without warning; they p99 at 10s; they return 500 for hours.
  • Some are slow; a single tenant at 30s latency can starve a worker pool serving thousands.
  • Some return 200 but never processed the event (at-least-once is your only realistic guarantee).
  • Some get duplicate events; they need an Idempotency-Key to dedupe.
  • Some expect strict ordering per customer object (a customer.updated from 10s ago cannot land after the latest one).

Scale anchor: Stripe delivers billions of webhooks per month; p99 endpoint response ~2s; retry budget up to 72 hours; Svix quotes 99.99% eventual delivery. Those numbers don't happen by accident.

Source cross-reference

DDIA Ch.11 (stream processing, exactly-once / effectively-once, idempotency) is the theory. Acing SDI Ch.5 covers distributed transactions and idempotency keys. Stripe's and Svix's public engineering blogs are the best real-world references; OpenAI's webhook-for-fine-tuning docs describe the same pattern.

Delivery state machine

Every webhook attempt is a tiny state machine. Drawing it on the whiteboard signals senior-level thinking.

stateDiagram-v2
  [*] --> Pending
  Pending --> Delivering: worker picks up
  Delivering --> Delivered: 2xx response
  Delivering --> Retrying: 5xx / timeout / 429
  Delivering --> Failed: 4xx (except 408/429)
  Retrying --> Delivering: backoff elapsed
  Retrying --> DLQ: max_attempts or 72h
  Delivered --> [*]
  Failed --> [*]
  DLQ --> [*]

Status rules:

  • 2xx → Delivered. Stop retrying.
  • 408 (timeout), 429 (rate-limit), 5xx → Retrying with backoff.
  • Other 4xx (malformed endpoint, bad signature) → Failed. Don't burn retries on permanent errors.
  • After N attempts or wall-clock cutoff → DLQ.

Architecture

flowchart LR
  EV[Event Source] --> IQ[Ingest API]
  IQ --> DB[(Delivery Store
Postgres)] IQ --> SCH[Scheduler] SCH --> WQ[Delivery Queue
partitioned by endpoint_id] WQ --> W1[Worker pool] W1 -->|HTTPS POST + HMAC| EP[Customer Endpoint] W1 --> DB W1 -->|failed| RQ[Retry queue
delayed] RQ --> WQ W1 -->|exhausted| DLQ[(DLQ)] DLQ --> UI[Dashboard replay]

Key design choice: the delivery queue is partitioned by destination endpoint, not by event. This is how you prevent one slow tenant from starving the fleet.

Retry, backoff, jitter, and DLQ

Backoff

Exponential backoff with jitter is the industry default. Stripe's published schedule is roughly: 10s, 1min, 5min, 15min, 1h, 2h, 4h, 8h, ..., capped at 72 hours. Concrete formula:

delay = min(base * 2**(attempt-1), cap)
jittered = random.uniform(0, delay)   # "full jitter" per AWS architecture blog

Why jitter? Without it, 10k failed deliveries all retry at exactly the same second and DDoS your own workers (and the customer's endpoint). Full jitter (uniform 0..delay) beats equal jitter (delay/2 + random) at reducing contention — AWS blog benchmarks show ~30% better completion time under high contention.

Retry budget

  • Max attempts: typically 16–24 (exponential climb caps out).
  • Max age: 24–72 hours (Stripe uses 72h).
  • After exhaustion → DLQ with full payload + error trace.

DLQ

The DLQ must be inspectable: a UI or API to list, filter, and replay failed deliveries after the customer fixes their endpoint. This is the #1 feature customers ask for. Don't use a raw Kafka topic without tooling.

Anti-pattern

Fixed 30-second retry with unlimited attempts. 10k stuck deliveries to one dead endpoint generate 10k RPS forever — you've built a DDoS gun pointing at your customer. Always use exponential + jitter + cutoff.

Idempotency and ordering

Idempotency

Because you retry, the customer will receive duplicates. They dedupe by a stable id. Contract:

  • Every event has a globally unique event.id — UUID v4 or Snowflake.
  • Send it in the body and the Webhook-ID header. Customer's INSERT ... ON CONFLICT DO NOTHING on that id makes processing exactly-once on their side.
  • Include event.created_at so customers can detect out-of-order delivery.

Ordering

At-least-once delivery with retries means out-of-order is inevitable. Two strategies:

  1. Best-effort order (Stripe default) — deliver in approximate order; customer uses created_at to discard stale updates (last-write-wins by timestamp on their side).
  2. Strict per-key order (opt-in) — partition the delivery queue by (endpoint_id, object_key), e.g., customer_id. One in-flight message per partition. If delivery 1 fails, delivery 2 waits. This is how Kafka gives per-partition order.

Strict ordering costs throughput: one stuck object blocks its entire key's queue. Offer it as a tier setting, not a default.

Poison-pill isolation and circuit breakers

The poison pill

A "poison pill" is a customer endpoint that fails repeatedly and ties up workers. Without isolation, 50 workers all serving the dead endpoint = 50 workers unavailable for everyone else. Solutions:

  • Per-endpoint concurrency cap: at most K in-flight deliveries per endpoint (typically K=8–16). Enforced with a semaphore or a per-endpoint queue.
  • Tenant-fair scheduling: worker picks the next eligible endpoint round-robin, not FIFO on the global queue. Prevents head-of-line blocking.
  • Quarantine queue: once an endpoint is circuit-broken, its messages go to a slow-lane queue with a tiny worker pool. Healthy tenants are unaffected.

Circuit breaker

Standard three states (closed / open / half-open). Trigger: error rate > 50% over last 1 minute and ≥ 20 attempts. When open:

  • Stop sending to that endpoint for N minutes (typically 5–30).
  • Queue keeps accumulating; if age exceeds retry budget, DLQ.
  • Half-open: send a single probe; on success → closed, on failure → back to open with longer cooldown.
  • Alert the customer via dashboard banner + email. Stripe does this.

Anthropic-specific

Anthropic's internal tooling for batch-processing webhooks (e.g., Message Batches API completion notifications) uses a similar per-endpoint circuit breaker. Batch completion events can arrive hours after submission, so the delivery platform must hold state long-term — a relational store (Postgres) is preferred over a short-TTL queue.

OpenAI-specific

OpenAI's webhook events for fine-tuning and file processing use signed HMAC headers and an X-OpenAI-Delivery id. Documented retry: up to 72 hours. Customers are advised to verify signatures before doing any database work — the canonical order is verify → dedupe by event id → process.

Security, observability, interview checklist

Security

  • HMAC signatures: X-Signature: t=<ts>,v1=hex(HMAC_SHA256(secret, ts + "." + body)). Include timestamp to prevent replay; reject if >5 min old.
  • TLS only — never POST secrets over plain HTTP.
  • IP allowlisting: publish your egress IPs so customer firewalls can whitelist.
  • Secret rotation: dual-secret window — customers verify against either for 24h, then old expires.

Observability

Per-endpoint dashboard: success rate, p50/p95/p99 latency, attempt histogram, DLQ count, time-since-last-success. Expose to customers too; they will save you support tickets. Global SLO: > 99.9% of events delivered within 5 minutes of creation.

Anti-patterns

  • Synchronous delivery inside the event producer. Your checkout pipeline should never block on POSTing to a customer. Always enqueue.
  • Single global FIFO queue. One slow customer = everybody slow. Partition by endpoint_id.
  • Retrying 4xx. Endpoint says 400 = you sent bad JSON; retrying doesn't fix it.
  • No timeout cap. Workers blocked 60s by a hanging endpoint — you bleed capacity. Use 15–30s timeout.
  • No DLQ replay UI. Customers call your oncall at 3am begging to redeliver yesterday's event.

Whiteboard checklist: producer → delivery store + scheduler; partitioned queue by endpoint_id; worker with HMAC + 15s timeout; retry state machine with exponential backoff + full jitter + 72h cap; idempotency via event.id and Webhook-ID header; per-endpoint circuit breaker; DLQ with replay UI; per-endpoint observability.

为什么 Webhook 很难

Webhook 平台是公开 API 的出站版本:事件发生时你的服务调用客户的 HTTP 端点(Stripe 的 charge.succeeded、GitHub 的 push、OpenAI 的微调完成)。看似简单——"POST JSON 而已"——但客户端点是世界上最敌对的依赖:

  • 端点毫无预警下线;p99 10s;连续几小时 500。
  • 慢租户 30s 延迟会饿死服务上千人的 worker 池。
  • 有的返回 200 却从未处理事件(至少一次是你唯一现实的保证)。
  • 有的收到重复事件;需要 Idempotency-Key 去重。
  • 有的要求按客户对象严格顺序(10s 前的 customer.updated 不能晚于最新那条)。

规模锚点:Stripe 每月投递数十亿 webhook;端点 p99 响应 ~2s;重试预算最长 72 小时;Svix 承诺 99.99% 最终投递。这些数字不是偶然的。

参考来源

DDIA 第 11 章是理论;Acing SDI 第 5 章讲分布式事务与幂等 key;Stripe 和 Svix 的工程博客最佳实战;OpenAI 微调 webhook 文档描述同样模式。

投递状态机

每次 webhook 尝试都是个小状态机。白板画出来说明你有 senior 思维。

stateDiagram-v2
  [*] --> Pending
  Pending --> Delivering: worker 拾取
  Delivering --> Delivered: 2xx
  Delivering --> Retrying: 5xx / timeout / 429
  Delivering --> Failed: 4xx (除 408/429)
  Retrying --> Delivering: 退避结束
  Retrying --> DLQ: 达到 max_attempts 或 72h
  Delivered --> [*]
  Failed --> [*]
  DLQ --> [*]

状态规则:2xx → Delivered;408/429/5xx → Retrying;其他 4xx → Failed(不要浪费重试在永久错误上);N 次或时间墙 → DLQ。

架构

flowchart LR
  EV[事件源] --> IQ[摄入 API]
  IQ --> DB[(投递存储
Postgres)] IQ --> SCH[调度器] SCH --> WQ[投递队列
按 endpoint_id 分区] WQ --> W1[Worker 池] W1 -->|HTTPS POST + HMAC| EP[客户端点] W1 --> DB W1 -->|失败| RQ[重试队列
延迟] RQ --> WQ W1 -->|耗尽| DLQ[(DLQ)] DLQ --> UI[Dashboard 回放]

关键设计:投递队列按目标端点分区,不是按事件。这是防止一个慢租户饿死整舰队的方式。

重试、退避、抖动、DLQ

退避

指数退避+抖动是行业默认。Stripe 公开计划大致是:10s、1min、5min、15min、1h、2h、4h、8h……上限 72 小时。公式:

delay = min(base * 2**(attempt-1), cap)
jittered = random.uniform(0, delay)   # AWS 架构博客的 full jitter

为什么抖动?没有的话,10000 失败投递都在同一秒重试,DDoS 自己的 worker 和客户端点。Full jitterequal jitter 在高竞争下减约 30% 完成时间(AWS 博客基准)。

重试预算

  • 最大尝试:通常 16–24(指数爬升会封顶)。
  • 最大龄:24–72 小时(Stripe 用 72h)。
  • 耗尽 → DLQ 附完整 payload + 错误追踪。

DLQ

DLQ 必须可检视:UI 或 API 列出、过滤、并在客户修复后回放失败投递。这是客户最常要求的功能。别直接扔到一个没工具的 Kafka topic。

反模式

固定 30 秒无限重试。10000 卡住的投递指向一个挂掉的端点 = 永远 10k RPS——你造了一把瞄准客户的 DDoS 枪。永远用指数+抖动+截止。

幂等与顺序

幂等

因为你会重试,客户一定会收到重复。他们按稳定 id 去重。合约:

  • 每个事件全局唯一 event.id——UUID v4 或 Snowflake。
  • 放在 body 与 Webhook-ID header 中。客户 INSERT ... ON CONFLICT DO NOTHING 让他们这边 exactly-once。
  • 包含 event.created_at,让客户检测乱序。

顺序

至少一次+重试意味着乱序不可避免。两种策略:

  1. 尽力而为顺序(Stripe 默认)——近似顺序投递;客户用 created_at 丢弃过期更新(last-write-wins)。
  2. 严格按 key 顺序(opt-in)——投递队列按 (endpoint_id, object_key)(例如 customer_id)分区。每分区单 in-flight。投递 1 失败时投递 2 等。这就是 Kafka 给每分区顺序的方式。

严格顺序代价是吞吐:一个卡住的对象会阻塞它的整个 key 队列。做成 tier 设置,别做默认。

毒丸隔离与熔断

毒丸

"毒丸"是反复失败占用 worker 的客户端点。没有隔离的话,50 worker 都服务那个死端点 = 大家都服务不到。方案:

  • 每端点并发上限:每端点最多 K in-flight(典型 K=8–16),用信号量或每端点队列。
  • 租户公平调度:worker 轮流选下一个合格端点,不是全局队列 FIFO。防止头阻塞。
  • 隔离队列:一旦端点被熔断,消息进入慢车道小 worker 池。健康租户不受影响。

熔断器

标准三态(closed/open/half-open)。触发:最近 1 分钟错误率 > 50% ≥20 次尝试。Open 状态下:

  • 停止发到该端点 N 分钟(5–30)。
  • 队列继续累积;龄超预算 → DLQ。
  • Half-open:发一个探针;成功 → closed,失败 → open 且冷却时间加长。
  • 通过 dashboard 横幅 + 邮件告知客户。Stripe 就这么做。

Anthropic 细节

Anthropic 批处理 webhook(例如 Message Batches API 完成通知)用类似的每端点熔断器。批完成事件可能提交数小时后才到,平台必须长久持状态——关系存储(Postgres)优于短 TTL 队列。

OpenAI 细节

OpenAI 微调与文件处理 webhook 用签名 HMAC header 与 X-OpenAI-Delivery id。文档重试:最长 72 小时。建议客户验证签名再做数据库工作——顺序是 verify → 按 event id 去重 → 处理。

安全、可观测性、面试清单

安全

  • HMAC 签名X-Signature: t=<ts>,v1=hex(HMAC_SHA256(secret, ts + "." + body))。含时间戳防重放,>5 分钟拒绝。
  • 只走 TLS——绝不明文 POST secret。
  • IP 白名单:公开出口 IP,让客户防火墙放行。
  • 密钥轮转:双 secret 24h 窗口,过期后旧 secret 失效。

可观测性

每端点仪表板:成功率、p50/p95/p99 延迟、尝试直方图、DLQ 数、距上次成功时间。也暴露给客户,能省你不少工单。全局 SLO:>99.9% 事件在创建后 5 分钟内投递。

反模式

  • 在事件生产方同步投递。结账流程绝不应该阻塞在 POST 客户端点。永远入队。
  • 单个全局 FIFO 队列。一个慢客户 = 全员慢。按 endpoint_id 分区。
  • 重试 4xx。端点说 400 = 你 JSON 错了,重试不会修。
  • 没有超时上限。挂起端点阻塞 worker 60s,你在流血。用 15-30s 超时。
  • 没有 DLQ 回放 UI。客户凌晨 3 点打你值班电话求重投昨天的事件。

白板清单:producer → 投递存储 + 调度;按 endpoint_id 分区队列;worker 加 HMAC 与 15s 超时;重试状态机(指数退避 + full jitter + 72h 上限);event.id 与 Webhook-ID 幂等;每端点熔断;带回放 UI 的 DLQ;每端点可观测。