Requirements and back-of-envelope

A rate limiter throttles traffic per identity (user, API key, IP, tenant, model) to protect downstream services and enforce quota. Start the interview by forcing three clarifying questions:

  • What is the identity key? Per-user, per-API-key, per-IP, per-org, or composite (e.g., (org_id, model) for Anthropic's claude-3-5-sonnet RPM).
  • What is the unit? Requests, bytes, tokens, or dollars. LLM APIs charge both RPM (requests/minute) and TPM (tokens/minute) — they are independent limits.
  • Hard or soft? Reject with HTTP 429 + Retry-After, shed to a cheaper tier, or queue with backpressure.

Scale sanity check: at 1M RPS, a naive per-request database write is impossible. A single Redis node handles ~100k ops/s; you must shard by key. Memory: one counter per (key, window) = ~100 bytes; 10M keys × 100B ≈ 1 GB — comfortable on one replica. Latency budget: <1ms p99 added to every request, which rules out synchronous DB writes.

Source cross-reference

Alex Xu V1 Ch.4 walks the four algorithms and the Redis implementation sketch. Acing SDI Ch.8 adds the control-plane (quota service) and discusses functional partitioning. ByteByteGo's "Top 5 Rate-Limiting Algorithms" article is the quick visual.

Four canonical algorithms

Memorize these four — any interview follow-up is a variant.

AlgorithmIdeaBurstMemory/keyTypical use
Token bucketBucket refills at rate r, capacity b; each request consumes 1 (or N) tokens.Allows bursts up to b.16 B (tokens, last_refill_ts)Public APIs (Stripe, AWS, Anthropic).
Leaky bucketFixed-rate queue drains at r; overflows drop.Strictly smooths.~24 B + queueOutbound (SMS, email gateways).
Fixed window counterCounter per wall-clock minute.2× burst at boundary.8 BCheap, coarse.
Sliding window logSorted set of exact timestamps, trim old.Exact.N × 16 B (N = limit)Small limits where accuracy matters.
Sliding window counterWeighted blend of current + prior fixed window.~5% error.16 BThe pragmatic default.

Token bucket in one formula: on each request tokens = min(b, tokens + (now - last_refill) * r); if tokens >= cost: tokens -= cost; allow else deny. This is the algorithm Anthropic and OpenAI use for published RPM/TPM.

Sliding window counter is the sweet spot for 100k-RPS gateways: count = curr_window * elapsed_fraction + prev_window * (1 - elapsed_fraction). Two 8-byte counters, no sorted-set cost.

Distributed implementation (Redis + Lua)

Any in-memory counter in a single app server is wrong the moment you autoscale. The standard pattern: Redis with an atomic Lua script.

-- KEYS[1]=bucket_key  ARGV={capacity, refill_rate, now_ms, cost}
local b = redis.call('HMGET', KEYS[1], 't', 'ts')
local tokens = tonumber(b[1]) or tonumber(ARGV[1])
local ts     = tonumber(b[2]) or tonumber(ARGV[3])
local delta  = (tonumber(ARGV[3]) - ts) * tonumber(ARGV[2]) / 1000
tokens = math.min(tonumber(ARGV[1]), tokens + delta)
if tokens < tonumber(ARGV[4]) then return {0, tokens} end
tokens = tokens - tonumber(ARGV[4])
redis.call('HMSET', KEYS[1], 't', tokens, 'ts', ARGV[3])
redis.call('PEXPIRE', KEYS[1], 60000)
return {1, tokens}

Lua runs atomically inside the Redis single thread — no race between HMGET and HMSET. Cost: one RTT (~0.3 ms intra-AZ). Shard by hashing (api_key, model); Redis Cluster handles resharding.

flowchart LR
  C[Client] --> GW[API Gateway / Envoy]
  GW --> RL{Rate Limiter
sidecar} RL -->|EVALSHA token_bucket.lua| R[(Redis Cluster
sharded by key)] RL -->|allow| APP[App / LLM inference] RL -->|deny 429 + Retry-After| C APP --> M[Metrics: tokens_used] M -->|async decrement| R

Consistency trade-off: global exactness requires a central Redis — a failure domain. Cell-based rate limiters approximate by giving each cell a share (e.g., 1/8 of the quota) and reconciling asynchronously. AWS API Gateway uses this.

Token-aware limits for LLM APIs

LLM inference is priced and capacity-planned by tokens, not requests. Both Anthropic and OpenAI publish two independent limits per model:

  • RPM — requests per minute (cheap to check at gateway).
  • TPM — tokens per minute = input + output. You cannot check this precisely before generation because output length is unknown.

The industry pattern is pessimistic debit, refund on completion:

  1. Count input tokens (tokenizer runs in <1ms for 4k tokens).
  2. Reserve input_tokens + max_output_tokens against the TPM bucket.
  3. If reservation fails → 429 with Retry-After = seconds until bucket refills to the request cost.
  4. After generation, refund (reserved - actual_output_tokens) back to the bucket.

Anthropic-specific

Anthropic exposes anthropic-ratelimit-tokens-remaining, anthropic-ratelimit-requests-remaining, and reset timestamps as response headers. Clients should parse these to do predictive client-side throttling rather than retry-until-429. Messages API also splits input/output TPM on some tiers.

OpenAI-specific

OpenAI returns x-ratelimit-* headers and also a Retry-After. The Tier system (Tier 1–5) raises limits based on cumulative spend; new tenants sit at 500 RPM / 30k TPM on GPT-4-class models. Azure OpenAI uses PTU (Provisioned Throughput Units) as a different capacity model — dedicated capacity vs shared pool.

Placement, failure modes, observability

Where does the limiter live?

  • Client SDK — cooperative, prevents wasted RTT; cannot be trusted (bypassed by bad actors).
  • Edge / API gateway (Envoy, Kong, AWS API GW) — authoritative, coarse-grained. Ideal for DDoS shedding.
  • Service mesh sidecar — per-service, uniform policy, centrally managed.
  • Application code — fine-grained (e.g., per-operation inside a service).

Production systems use all four — defense in depth.

Failure modes

  • Redis down. Options: fail-open (availability over quota) or fail-closed (block until Redis returns). Anthropic/OpenAI fail-open with a local cap so a Redis outage doesn't 429 everyone.
  • Clock skew. Use monotonic or server-provided time; never trust client now.
  • Hot key. One whale tenant overwhelms a single Redis shard. Mitigate with key splitting (hash(key) % N sub-buckets, summed) or local token pre-fetching.
  • Retry storm. Clients that ignore Retry-After amplify load. Always return jittered hints.

Observability

Emit: rl_allow_total{tenant,model}, rl_deny_total, rl_bucket_utilization. Alert when deny_total/allow_total > 1% for a top-N tenant — usually means their quota needs a bump or they have a bug.

Interview anti-patterns and checklist

Anti-patterns to avoid

  • In-process counter behind a load balancer. Each replica gets limit/N and bursts are 2× at window boundaries.
  • Unbounded sorted-set log per IP. At 10k RPS one IP stores 600k timestamps in a single key — OOM.
  • Read-modify-write without Lua. Two concurrent requests both read 0 tokens and both allow themselves. Use EVALSHA or WATCH/MULTI/EXEC.
  • Ignoring output tokens. For LLMs, charging only input tokens lets a 16k-output request slip under any input-only limit.
  • Hard-coded limits. Put them in a config service or DB so you can raise Tier-5 customers without a deploy.

Whiteboard checklist (say these out loud):

  1. Identity + unit + hard/soft.
  2. Algorithm pick (sliding window counter is a safe default).
  3. Redis + Lua for atomicity; shard by identity.
  4. Return 429 with Retry-After and rate-limit headers.
  5. Fail-open with local cap on Redis outage.
  6. Token-aware (reserve + refund) for LLM APIs.
  7. Multi-layer: client SDK + edge + sidecar + app.
  8. Metrics: allow/deny/utilization per tenant.

需求与容量估算

限流器按身份(用户、API Key、IP、租户、模型)限制流量,保护下游服务并执行配额。面试开场先问三个澄清问题:

  • 身份 key 是什么? per-user、per-API-key、per-IP、per-org,或者组合(例如 Anthropic 的 (org_id, model))。
  • 计量单位是什么? 请求数、字节、tokens 还是美元。LLM API 同时收 RPM(请求/分钟)和 TPM(tokens/分钟)两种独立限制。
  • 硬限还是软限? 返回 429 + Retry-After、降级到便宜的模型、或者带背压的排队。

规模估算:1M RPS 下,朴素的每请求写库不现实。单 Redis 节点 ~10 万 ops/s,必须按 key 分片。内存:每 (key, window) ≈ 100 B;1000 万 key ≈ 1 GB,单副本 OK。延迟预算:给每个请求增加 <1ms p99,这排除了同步落库。

参考来源

Alex Xu V1 第 4 章详细讲了四种算法和 Redis 草图;Acing SDI 第 8 章补充了配额服务的控制面;ByteByteGo 的"Top 5 Rate-Limiting Algorithms"文章是快速视觉记忆。

四种经典算法

背熟这四个——任何面试变体都能套。

算法思路突发每 key 内存典型使用
Token bucket 令牌桶r 速率补充,容量 b;每请求消耗 1 或 N 个允许 b 突发16 B公网 API (Stripe、AWS、Anthropic)
Leaky bucket 漏桶固定速率队列 r,溢出丢弃严格平滑~24 B + 队列出站 (SMS、邮件网关)
Fixed window 固定窗口每分钟一个计数器边界处 2× 突发8 B便宜粗糙
Sliding window log按时间戳排序集合,修剪旧值精确N × 16 B小限额要求精确
Sliding window counter当前+前一固定窗口的加权混合~5% 误差16 B实战默认

令牌桶一行公式: tokens = min(b, tokens + (now - last_refill) * r); 若 tokens >= cost 则允许。Anthropic 与 OpenAI 的 RPM/TPM 正是这个算法。

滑动窗口计数器 是 10 万 QPS 网关的甜点:两个 8 字节计数器,无 sorted set 开销。

分布式实现 (Redis + Lua)

单机内存计数器在自动扩缩容后就错了。标准做法是 Redis + 原子 Lua。

-- KEYS[1]=bucket_key  ARGV={capacity, refill_rate, now_ms, cost}
local b = redis.call('HMGET', KEYS[1], 't', 'ts')
local tokens = tonumber(b[1]) or tonumber(ARGV[1])
local ts     = tonumber(b[2]) or tonumber(ARGV[3])
local delta  = (tonumber(ARGV[3]) - ts) * tonumber(ARGV[2]) / 1000
tokens = math.min(tonumber(ARGV[1]), tokens + delta)
if tokens < tonumber(ARGV[4]) then return {0, tokens} end
tokens = tokens - tonumber(ARGV[4])
redis.call('HMSET', KEYS[1], 't', tokens, 'ts', ARGV[3])
redis.call('PEXPIRE', KEYS[1], 60000)
return {1, tokens}

Lua 在 Redis 单线程内原子执行,HMGET 与 HMSET 之间不会有竞态。代价:一个 RTT(同 AZ ~0.3ms)。用 (api_key, model) 哈希分片,Redis Cluster 负责再分片。

flowchart LR
  C[Client] --> GW[API Gateway / Envoy]
  GW --> RL{限流器 Sidecar}
  RL -->|EVALSHA token_bucket.lua| R[(Redis Cluster
按 key 分片)] RL -->|allow| APP[App / LLM 推理] RL -->|deny 429 + Retry-After| C APP --> M[指标: tokens_used] M -->|异步扣减| R

一致性取舍:全局精确要求中心化 Redis,是个故障域。Cell-based 限流给每 cell 分配一份配额(例如 1/8),异步对账——AWS API Gateway 用这招。

LLM API 的按 token 限流

LLM 推理按 tokens 计价与规划容量,不是按请求。Anthropic 和 OpenAI 每个模型都发布两个独立限制:

  • RPM——请求/分钟(网关便宜可查)。
  • TPM——tokens/分钟 = 输入 + 输出。你在生成前无法精确知道输出长度。

行业做法是 悲观扣减、完成后退款

  1. 统计输入 tokens(tokenizer 对 4k tokens 跑 <1ms)。
  2. input + max_output 在 TPM 桶上预扣。
  3. 预扣失败 → 429 + Retry-After = 桶回填到够用需要的秒数。
  4. 生成结束后退还 (预扣 - 实际输出)

Anthropic 细节

Anthropic 在响应头返回 anthropic-ratelimit-tokens-remaininganthropic-ratelimit-requests-remaining 与重置时间戳。客户端应解析并做预测性节流,而不是撞 429 再重试。Messages API 在部分 tier 上还会把输入/输出 TPM 分开。

OpenAI 细节

OpenAI 返回 x-ratelimit-*Retry-After。Tier 1–5 按累计消费升级限额;新租户在 GPT-4 类模型上只有 500 RPM / 30k TPM。Azure OpenAI 用 PTU(Provisioned Throughput Units)模型——独占容量而非共享池。

部署位置、故障模式、可观测性

限流器放在哪?

  • 客户端 SDK——合作式,避免浪费 RTT;不可信(可被绕过)。
  • 边缘 / API 网关 (Envoy、Kong、AWS API GW)——权威、粗粒度,适合 DDoS 削峰。
  • 服务网格 sidecar——每服务、策略一致、集中管理。
  • 应用代码——细粒度(例如每操作)。

生产系统四层都有——纵深防御。

故障模式

  • Redis 挂了。选项:fail-open(可用性优先)或 fail-closed(阻塞)。Anthropic/OpenAI 都 fail-open 加本地上限,避免 Redis 故障导致全员 429。
  • 时钟漂移。用单调时钟或服务端时间,绝不信客户端 now
  • 热点 key。某大户打爆单分片,解法是 key 拆分(hash(key) % N 子桶求和)或本地预取 token。
  • 重试风暴。忽略 Retry-After 的客户端会放大负载,务必返回带 jitter 的提示。

可观测性

指标:rl_allow_total{tenant,model}rl_deny_totalrl_bucket_utilization。Top-N 租户 deny/allow > 1% 时告警——通常说明要提额度或客户端有 bug。

面试反模式与清单

面试要避开的反模式

  • 负载均衡后面的进程内计数器。每个副本只有 limit/N,窗口边界 2× 突发。
  • per-IP 无界 sorted set。10k RPS 的 IP 单 key 60 万时间戳,OOM。
  • 没 Lua 的 RMW。两个并发都读到 0 tokens 都通过。必须 EVALSHA 或 WATCH/MULTI/EXEC。
  • 忽略输出 tokens。LLM 只按输入限流,16k 输出可以绕过。
  • 硬编码限额。放 config 服务或 DB,才能不发版给 Tier-5 客户提额。

白板清单(请大声说出来):身份 + 单位 + 硬/软;算法(滑动窗口计数器默认);Redis + Lua 原子;返回 429 + Retry-After;Redis 故障 fail-open + 本地上限;LLM 按 token 预扣退款;客户端 + 边缘 + sidecar + app 多层;每租户 allow/deny 指标。