Rate Limiter — SD-Guide

Requirements and back-of-envelope

A rate limiter throttles traffic per identity (user, API key, IP, tenant, model) to protect downstream services and enforce quota. Start the interview by forcing three clarifying questions:

What is the identity key? Per-user, per-API-key, per-IP, per-org, or composite (e.g., (org_id, model) for Anthropic's claude-3-5-sonnet RPM).
What is the unit? Requests, bytes, tokens, or dollars. LLM APIs charge both RPM (requests/minute) and TPM (tokens/minute) — they are independent limits.
Hard or soft? Reject with HTTP 429 + Retry-After, shed to a cheaper tier, or queue with backpressure.

Scale sanity check: at 1M RPS, a naive per-request database write is impossible. A single Redis node handles ~100k ops/s; you must shard by key. Memory: one counter per (key, window) = ~100 bytes; 10M keys × 100B ≈ 1 GB — comfortable on one replica. Latency budget: <1ms p99 added to every request, which rules out synchronous DB writes.

Source cross-reference

Alex Xu V1 Ch.4 walks the four algorithms and the Redis implementation sketch. Acing SDI Ch.8 adds the control-plane (quota service) and discusses functional partitioning. ByteByteGo's "Top 5 Rate-Limiting Algorithms" article is the quick visual.

Four canonical algorithms

Memorize these four — any interview follow-up is a variant.

Algorithm	Idea	Burst	Memory/key	Typical use
Token bucket	Bucket refills at rate `r`, capacity `b`; each request consumes 1 (or N) tokens.	Allows bursts up to `b`.	16 B (tokens, last_refill_ts)	Public APIs (Stripe, AWS, Anthropic).
Leaky bucket	Fixed-rate queue drains at `r`; overflows drop.	Strictly smooths.	~24 B + queue	Outbound (SMS, email gateways).
Fixed window counter	Counter per wall-clock minute.	2× burst at boundary.	8 B	Cheap, coarse.
Sliding window log	Sorted set of exact timestamps, trim old.	Exact.	N × 16 B (N = limit)	Small limits where accuracy matters.
Sliding window counter	Weighted blend of current + prior fixed window.	~5% error.	16 B	The pragmatic default.

Token bucket in one formula: on each request tokens = min(b, tokens + (now - last_refill) * r); if tokens >= cost: tokens -= cost; allow else deny. This is the algorithm Anthropic and OpenAI use for published RPM/TPM.

Sliding window counter is the sweet spot for 100k-RPS gateways: count = curr_window * elapsed_fraction + prev_window * (1 - elapsed_fraction). Two 8-byte counters, no sorted-set cost.

Distributed implementation (Redis + Lua)

Any in-memory counter in a single app server is wrong the moment you autoscale. The standard pattern: Redis with an atomic Lua script.

-- KEYS[1]=bucket_key  ARGV={capacity, refill_rate, now_ms, cost}
local b = redis.call('HMGET', KEYS[1], 't', 'ts')
local tokens = tonumber(b[1]) or tonumber(ARGV[1])
local ts     = tonumber(b[2]) or tonumber(ARGV[3])
local delta  = (tonumber(ARGV[3]) - ts) * tonumber(ARGV[2]) / 1000
tokens = math.min(tonumber(ARGV[1]), tokens + delta)
if tokens < tonumber(ARGV[4]) then return {0, tokens} end
tokens = tokens - tonumber(ARGV[4])
redis.call('HMSET', KEYS[1], 't', tokens, 'ts', ARGV[3])
redis.call('PEXPIRE', KEYS[1], 60000)
return {1, tokens}

Lua runs atomically inside the Redis single thread — no race between HMGET and HMSET. Cost: one RTT (~0.3 ms intra-AZ). Shard by hashing (api_key, model); Redis Cluster handles resharding.

flowchart LR
  C[Client] --> GW[API Gateway / Envoy]
  GW --> RL{Rate Limiter
sidecar}
  RL -->|EVALSHA token_bucket.lua| R[(Redis Cluster
sharded by key)]
  RL -->|allow| APP[App / LLM inference]
  RL -->|deny 429 + Retry-After| C
  APP --> M[Metrics: tokens_used]
  M -->|async decrement| R

Consistency trade-off: global exactness requires a central Redis — a failure domain. Cell-based rate limiters approximate by giving each cell a share (e.g., 1/8 of the quota) and reconciling asynchronously. AWS API Gateway uses this.

Token-aware limits for LLM APIs

LLM inference is priced and capacity-planned by tokens, not requests. Both Anthropic and OpenAI publish two independent limits per model:

RPM — requests per minute (cheap to check at gateway).
TPM — tokens per minute = input + output. You cannot check this precisely before generation because output length is unknown.

The industry pattern is pessimistic debit, refund on completion:

Count input tokens (tokenizer runs in <1ms for 4k tokens).
Reserve input_tokens + max_output_tokens against the TPM bucket.
If reservation fails → 429 with Retry-After = seconds until bucket refills to the request cost.
After generation, refund (reserved - actual_output_tokens) back to the bucket.

Anthropic-specific

Anthropic exposes anthropic-ratelimit-tokens-remaining, anthropic-ratelimit-requests-remaining, and reset timestamps as response headers. Clients should parse these to do predictive client-side throttling rather than retry-until-429. Messages API also splits input/output TPM on some tiers.

OpenAI-specific

OpenAI returns x-ratelimit-* headers and also a Retry-After. The Tier system (Tier 1–5) raises limits based on cumulative spend; new tenants sit at 500 RPM / 30k TPM on GPT-4-class models. Azure OpenAI uses PTU (Provisioned Throughput Units) as a different capacity model — dedicated capacity vs shared pool.

Placement, failure modes, observability

Where does the limiter live?

Client SDK — cooperative, prevents wasted RTT; cannot be trusted (bypassed by bad actors).
Edge / API gateway (Envoy, Kong, AWS API GW) — authoritative, coarse-grained. Ideal for DDoS shedding.
Service mesh sidecar — per-service, uniform policy, centrally managed.
Application code — fine-grained (e.g., per-operation inside a service).

Production systems use all four — defense in depth.

Failure modes

Redis down. Options: fail-open (availability over quota) or fail-closed (block until Redis returns). Anthropic/OpenAI fail-open with a local cap so a Redis outage doesn't 429 everyone.
Clock skew. Use monotonic or server-provided time; never trust client now.
Hot key. One whale tenant overwhelms a single Redis shard. Mitigate with key splitting (hash(key) % N sub-buckets, summed) or local token pre-fetching.
Retry storm. Clients that ignore Retry-After amplify load. Always return jittered hints.

Observability

Emit: rl_allow_total{tenant,model}, rl_deny_total, rl_bucket_utilization. Alert when deny_total/allow_total > 1% for a top-N tenant — usually means their quota needs a bump or they have a bug.

Interview anti-patterns and checklist

Anti-patterns to avoid

In-process counter behind a load balancer. Each replica gets limit/N and bursts are 2× at window boundaries.
Unbounded sorted-set log per IP. At 10k RPS one IP stores 600k timestamps in a single key — OOM.
Read-modify-write without Lua. Two concurrent requests both read 0 tokens and both allow themselves. Use EVALSHA or WATCH/MULTI/EXEC.
Ignoring output tokens. For LLMs, charging only input tokens lets a 16k-output request slip under any input-only limit.
Hard-coded limits. Put them in a config service or DB so you can raise Tier-5 customers without a deploy.

Whiteboard checklist (say these out loud):

Identity + unit + hard/soft.
Algorithm pick (sliding window counter is a safe default).
Redis + Lua for atomicity; shard by identity.
Return 429 with Retry-After and rate-limit headers.
Fail-open with local cap on Redis outage.
Token-aware (reserve + refund) for LLM APIs.
Multi-layer: client SDK + edge + sidecar + app.
Metrics: allow/deny/utilization per tenant.

需求与容量估算

限流器按身份（用户、API Key、IP、租户、模型）限制流量，保护下游服务并执行配额。面试开场先问三个澄清问题：

身份 key 是什么？ per-user、per-API-key、per-IP、per-org，或者组合（例如 Anthropic 的 (org_id, model)）。
计量单位是什么？ 请求数、字节、tokens 还是美元。LLM API 同时收 RPM（请求/分钟）和 TPM（tokens/分钟）两种独立限制。
硬限还是软限？ 返回 429 + Retry-After、降级到便宜的模型、或者带背压的排队。

规模估算：1M RPS 下，朴素的每请求写库不现实。单 Redis 节点 ~10 万 ops/s，必须按 key 分片。内存：每 (key, window) ≈ 100 B；1000 万 key ≈ 1 GB，单副本 OK。延迟预算：给每个请求增加 <1ms p99，这排除了同步落库。

参考来源

Alex Xu V1 第 4 章详细讲了四种算法和 Redis 草图；Acing SDI 第 8 章补充了配额服务的控制面；ByteByteGo 的"Top 5 Rate-Limiting Algorithms"文章是快速视觉记忆。

四种经典算法

背熟这四个——任何面试变体都能套。

算法	思路	突发	每 key 内存	典型使用
Token bucket 令牌桶	以 `r` 速率补充，容量 `b`；每请求消耗 1 或 N 个	允许 `b` 突发	16 B	公网 API (Stripe、AWS、Anthropic)
Leaky bucket 漏桶	固定速率队列 `r`，溢出丢弃	严格平滑	~24 B + 队列	出站 (SMS、邮件网关)
Fixed window 固定窗口	每分钟一个计数器	边界处 2× 突发	8 B	便宜粗糙
Sliding window log	按时间戳排序集合，修剪旧值	精确	N × 16 B	小限额要求精确
Sliding window counter	当前+前一固定窗口的加权混合	~5% 误差	16 B	实战默认

令牌桶一行公式： tokens = min(b, tokens + (now - last_refill) * r); 若 tokens >= cost 则允许。Anthropic 与 OpenAI 的 RPM/TPM 正是这个算法。

滑动窗口计数器 是 10 万 QPS 网关的甜点：两个 8 字节计数器，无 sorted set 开销。

分布式实现 (Redis + Lua)

单机内存计数器在自动扩缩容后就错了。标准做法是 Redis + 原子 Lua。

-- KEYS[1]=bucket_key  ARGV={capacity, refill_rate, now_ms, cost}
local b = redis.call('HMGET', KEYS[1], 't', 'ts')
local tokens = tonumber(b[1]) or tonumber(ARGV[1])
local ts     = tonumber(b[2]) or tonumber(ARGV[3])
local delta  = (tonumber(ARGV[3]) - ts) * tonumber(ARGV[2]) / 1000
tokens = math.min(tonumber(ARGV[1]), tokens + delta)
if tokens < tonumber(ARGV[4]) then return {0, tokens} end
tokens = tokens - tonumber(ARGV[4])
redis.call('HMSET', KEYS[1], 't', tokens, 'ts', ARGV[3])
redis.call('PEXPIRE', KEYS[1], 60000)
return {1, tokens}

Lua 在 Redis 单线程内原子执行，HMGET 与 HMSET 之间不会有竞态。代价：一个 RTT（同 AZ ~0.3ms）。用 (api_key, model) 哈希分片，Redis Cluster 负责再分片。

flowchart LR
  C[Client] --> GW[API Gateway / Envoy]
  GW --> RL{限流器 Sidecar}
  RL -->|EVALSHA token_bucket.lua| R[(Redis Cluster
按 key 分片)]
  RL -->|allow| APP[App / LLM 推理]
  RL -->|deny 429 + Retry-After| C
  APP --> M[指标: tokens_used]
  M -->|异步扣减| R

一致性取舍：全局精确要求中心化 Redis，是个故障域。Cell-based 限流给每 cell 分配一份配额（例如 1/8），异步对账——AWS API Gateway 用这招。

LLM API 的按 token 限流

LLM 推理按 tokens 计价与规划容量，不是按请求。Anthropic 和 OpenAI 每个模型都发布两个独立限制：

RPM——请求/分钟（网关便宜可查）。
TPM——tokens/分钟 = 输入 + 输出。你在生成前无法精确知道输出长度。

行业做法是 悲观扣减、完成后退款：

统计输入 tokens（tokenizer 对 4k tokens 跑 <1ms）。
按 input + max_output 在 TPM 桶上预扣。
预扣失败 → 429 + Retry-After = 桶回填到够用需要的秒数。
生成结束后退还 (预扣 - 实际输出)。

Anthropic 细节

Anthropic 在响应头返回 anthropic-ratelimit-tokens-remaining、anthropic-ratelimit-requests-remaining 与重置时间戳。客户端应解析并做预测性节流，而不是撞 429 再重试。Messages API 在部分 tier 上还会把输入/输出 TPM 分开。

OpenAI 细节

OpenAI 返回 x-ratelimit-* 和 Retry-After。Tier 1–5 按累计消费升级限额；新租户在 GPT-4 类模型上只有 500 RPM / 30k TPM。Azure OpenAI 用 PTU（Provisioned Throughput Units）模型——独占容量而非共享池。

部署位置、故障模式、可观测性

限流器放在哪？

客户端 SDK——合作式，避免浪费 RTT；不可信（可被绕过）。
边缘 / API 网关 (Envoy、Kong、AWS API GW)——权威、粗粒度，适合 DDoS 削峰。
服务网格 sidecar——每服务、策略一致、集中管理。
应用代码——细粒度（例如每操作）。

生产系统四层都有——纵深防御。

故障模式

Redis 挂了。选项：fail-open（可用性优先）或 fail-closed（阻塞）。Anthropic/OpenAI 都 fail-open 加本地上限，避免 Redis 故障导致全员 429。
时钟漂移。用单调时钟或服务端时间，绝不信客户端 now。
热点 key。某大户打爆单分片，解法是 key 拆分（hash(key) % N 子桶求和）或本地预取 token。
重试风暴。忽略 Retry-After 的客户端会放大负载，务必返回带 jitter 的提示。

可观测性

指标：rl_allow_total{tenant,model}、rl_deny_total、rl_bucket_utilization。Top-N 租户 deny/allow > 1% 时告警——通常说明要提额度或客户端有 bug。

面试反模式与清单

面试要避开的反模式

负载均衡后面的进程内计数器。每个副本只有 limit/N，窗口边界 2× 突发。
per-IP 无界 sorted set。10k RPS 的 IP 单 key 60 万时间戳，OOM。
没 Lua 的 RMW。两个并发都读到 0 tokens 都通过。必须 EVALSHA 或 WATCH/MULTI/EXEC。
忽略输出 tokens。LLM 只按输入限流，16k 输出可以绕过。
硬编码限额。放 config 服务或 DB，才能不发版给 Tier-5 客户提额。

白板清单（请大声说出来）：身份 + 单位 + 硬/软；算法（滑动窗口计数器默认）；Redis + Lua 原子；返回 429 + Retry-After；Redis 故障 fail-open + 本地上限；LLM 按 token 预扣退款；客户端 + 边缘 + sidecar + app 多层；每租户 allow/deny 指标。