Requirements and back-of-envelope
A rate limiter throttles traffic per identity (user, API key, IP, tenant, model) to protect downstream services and enforce quota. Start the interview by forcing three clarifying questions:
- What is the identity key? Per-user, per-API-key, per-IP, per-org, or composite (e.g.,
(org_id, model)for Anthropic'sclaude-3-5-sonnetRPM). - What is the unit? Requests, bytes, tokens, or dollars. LLM APIs charge both RPM (requests/minute) and TPM (tokens/minute) — they are independent limits.
- Hard or soft? Reject with HTTP 429 +
Retry-After, shed to a cheaper tier, or queue with backpressure.
Scale sanity check: at 1M RPS, a naive per-request database write is impossible. A single Redis node handles ~100k ops/s; you must shard by key. Memory: one counter per (key, window) = ~100 bytes; 10M keys × 100B ≈ 1 GB — comfortable on one replica. Latency budget: <1ms p99 added to every request, which rules out synchronous DB writes.
Source cross-reference
Alex Xu V1 Ch.4 walks the four algorithms and the Redis implementation sketch. Acing SDI Ch.8 adds the control-plane (quota service) and discusses functional partitioning. ByteByteGo's "Top 5 Rate-Limiting Algorithms" article is the quick visual.
Four canonical algorithms
Memorize these four — any interview follow-up is a variant.
| Algorithm | Idea | Burst | Memory/key | Typical use |
|---|---|---|---|---|
| Token bucket | Bucket refills at rate r, capacity b; each request consumes 1 (or N) tokens. | Allows bursts up to b. | 16 B (tokens, last_refill_ts) | Public APIs (Stripe, AWS, Anthropic). |
| Leaky bucket | Fixed-rate queue drains at r; overflows drop. | Strictly smooths. | ~24 B + queue | Outbound (SMS, email gateways). |
| Fixed window counter | Counter per wall-clock minute. | 2× burst at boundary. | 8 B | Cheap, coarse. |
| Sliding window log | Sorted set of exact timestamps, trim old. | Exact. | N × 16 B (N = limit) | Small limits where accuracy matters. |
| Sliding window counter | Weighted blend of current + prior fixed window. | ~5% error. | 16 B | The pragmatic default. |
Token bucket in one formula: on each request tokens = min(b, tokens + (now - last_refill) * r); if tokens >= cost: tokens -= cost; allow else deny. This is the algorithm Anthropic and OpenAI use for published RPM/TPM.
Sliding window counter is the sweet spot for 100k-RPS gateways: count = curr_window * elapsed_fraction + prev_window * (1 - elapsed_fraction). Two 8-byte counters, no sorted-set cost.
Distributed implementation (Redis + Lua)
Any in-memory counter in a single app server is wrong the moment you autoscale. The standard pattern: Redis with an atomic Lua script.
-- KEYS[1]=bucket_key ARGV={capacity, refill_rate, now_ms, cost}
local b = redis.call('HMGET', KEYS[1], 't', 'ts')
local tokens = tonumber(b[1]) or tonumber(ARGV[1])
local ts = tonumber(b[2]) or tonumber(ARGV[3])
local delta = (tonumber(ARGV[3]) - ts) * tonumber(ARGV[2]) / 1000
tokens = math.min(tonumber(ARGV[1]), tokens + delta)
if tokens < tonumber(ARGV[4]) then return {0, tokens} end
tokens = tokens - tonumber(ARGV[4])
redis.call('HMSET', KEYS[1], 't', tokens, 'ts', ARGV[3])
redis.call('PEXPIRE', KEYS[1], 60000)
return {1, tokens}
Lua runs atomically inside the Redis single thread — no race between HMGET and HMSET. Cost: one RTT (~0.3 ms intra-AZ). Shard by hashing (api_key, model); Redis Cluster handles resharding.
flowchart LR
C[Client] --> GW[API Gateway / Envoy]
GW --> RL{Rate Limiter
sidecar}
RL -->|EVALSHA token_bucket.lua| R[(Redis Cluster
sharded by key)]
RL -->|allow| APP[App / LLM inference]
RL -->|deny 429 + Retry-After| C
APP --> M[Metrics: tokens_used]
M -->|async decrement| R
Consistency trade-off: global exactness requires a central Redis — a failure domain. Cell-based rate limiters approximate by giving each cell a share (e.g., 1/8 of the quota) and reconciling asynchronously. AWS API Gateway uses this.
Token-aware limits for LLM APIs
LLM inference is priced and capacity-planned by tokens, not requests. Both Anthropic and OpenAI publish two independent limits per model:
- RPM — requests per minute (cheap to check at gateway).
- TPM — tokens per minute = input + output. You cannot check this precisely before generation because output length is unknown.
The industry pattern is pessimistic debit, refund on completion:
- Count input tokens (tokenizer runs in <1ms for 4k tokens).
- Reserve
input_tokens + max_output_tokensagainst the TPM bucket. - If reservation fails → 429 with
Retry-After= seconds until bucket refills to the request cost. - After generation, refund
(reserved - actual_output_tokens)back to the bucket.
Anthropic-specific
Anthropic exposes anthropic-ratelimit-tokens-remaining, anthropic-ratelimit-requests-remaining, and reset timestamps as response headers. Clients should parse these to do predictive client-side throttling rather than retry-until-429. Messages API also splits input/output TPM on some tiers.
OpenAI-specific
OpenAI returns x-ratelimit-* headers and also a Retry-After. The Tier system (Tier 1–5) raises limits based on cumulative spend; new tenants sit at 500 RPM / 30k TPM on GPT-4-class models. Azure OpenAI uses PTU (Provisioned Throughput Units) as a different capacity model — dedicated capacity vs shared pool.
Placement, failure modes, observability
Where does the limiter live?
- Client SDK — cooperative, prevents wasted RTT; cannot be trusted (bypassed by bad actors).
- Edge / API gateway (Envoy, Kong, AWS API GW) — authoritative, coarse-grained. Ideal for DDoS shedding.
- Service mesh sidecar — per-service, uniform policy, centrally managed.
- Application code — fine-grained (e.g., per-operation inside a service).
Production systems use all four — defense in depth.
Failure modes
- Redis down. Options: fail-open (availability over quota) or fail-closed (block until Redis returns). Anthropic/OpenAI fail-open with a local cap so a Redis outage doesn't 429 everyone.
- Clock skew. Use monotonic or server-provided time; never trust client
now. - Hot key. One whale tenant overwhelms a single Redis shard. Mitigate with key splitting (
hash(key) % Nsub-buckets, summed) or local token pre-fetching. - Retry storm. Clients that ignore
Retry-Afteramplify load. Always return jittered hints.
Observability
Emit: rl_allow_total{tenant,model}, rl_deny_total, rl_bucket_utilization. Alert when deny_total/allow_total > 1% for a top-N tenant — usually means their quota needs a bump or they have a bug.
Interview anti-patterns and checklist
Anti-patterns to avoid
- In-process counter behind a load balancer. Each replica gets
limit/Nand bursts are 2× at window boundaries. - Unbounded sorted-set log per IP. At 10k RPS one IP stores 600k timestamps in a single key — OOM.
- Read-modify-write without Lua. Two concurrent requests both read 0 tokens and both allow themselves. Use EVALSHA or
WATCH/MULTI/EXEC. - Ignoring output tokens. For LLMs, charging only input tokens lets a 16k-output request slip under any input-only limit.
- Hard-coded limits. Put them in a config service or DB so you can raise Tier-5 customers without a deploy.
Whiteboard checklist (say these out loud):
- Identity + unit + hard/soft.
- Algorithm pick (sliding window counter is a safe default).
- Redis + Lua for atomicity; shard by identity.
- Return 429 with
Retry-Afterand rate-limit headers. - Fail-open with local cap on Redis outage.
- Token-aware (reserve + refund) for LLM APIs.
- Multi-layer: client SDK + edge + sidecar + app.
- Metrics: allow/deny/utilization per tenant.
需求与容量估算
限流器按身份(用户、API Key、IP、租户、模型)限制流量,保护下游服务并执行配额。面试开场先问三个澄清问题:
- 身份 key 是什么? per-user、per-API-key、per-IP、per-org,或者组合(例如 Anthropic 的
(org_id, model))。 - 计量单位是什么? 请求数、字节、tokens 还是美元。LLM API 同时收 RPM(请求/分钟)和 TPM(tokens/分钟)两种独立限制。
- 硬限还是软限? 返回 429 +
Retry-After、降级到便宜的模型、或者带背压的排队。
规模估算:1M RPS 下,朴素的每请求写库不现实。单 Redis 节点 ~10 万 ops/s,必须按 key 分片。内存:每 (key, window) ≈ 100 B;1000 万 key ≈ 1 GB,单副本 OK。延迟预算:给每个请求增加 <1ms p99,这排除了同步落库。
参考来源
Alex Xu V1 第 4 章详细讲了四种算法和 Redis 草图;Acing SDI 第 8 章补充了配额服务的控制面;ByteByteGo 的"Top 5 Rate-Limiting Algorithms"文章是快速视觉记忆。
四种经典算法
背熟这四个——任何面试变体都能套。
| 算法 | 思路 | 突发 | 每 key 内存 | 典型使用 |
|---|---|---|---|---|
| Token bucket 令牌桶 | 以 r 速率补充,容量 b;每请求消耗 1 或 N 个 | 允许 b 突发 | 16 B | 公网 API (Stripe、AWS、Anthropic) |
| Leaky bucket 漏桶 | 固定速率队列 r,溢出丢弃 | 严格平滑 | ~24 B + 队列 | 出站 (SMS、邮件网关) |
| Fixed window 固定窗口 | 每分钟一个计数器 | 边界处 2× 突发 | 8 B | 便宜粗糙 |
| Sliding window log | 按时间戳排序集合,修剪旧值 | 精确 | N × 16 B | 小限额要求精确 |
| Sliding window counter | 当前+前一固定窗口的加权混合 | ~5% 误差 | 16 B | 实战默认 |
令牌桶一行公式: tokens = min(b, tokens + (now - last_refill) * r); 若 tokens >= cost 则允许。Anthropic 与 OpenAI 的 RPM/TPM 正是这个算法。
滑动窗口计数器 是 10 万 QPS 网关的甜点:两个 8 字节计数器,无 sorted set 开销。
分布式实现 (Redis + Lua)
单机内存计数器在自动扩缩容后就错了。标准做法是 Redis + 原子 Lua。
-- KEYS[1]=bucket_key ARGV={capacity, refill_rate, now_ms, cost}
local b = redis.call('HMGET', KEYS[1], 't', 'ts')
local tokens = tonumber(b[1]) or tonumber(ARGV[1])
local ts = tonumber(b[2]) or tonumber(ARGV[3])
local delta = (tonumber(ARGV[3]) - ts) * tonumber(ARGV[2]) / 1000
tokens = math.min(tonumber(ARGV[1]), tokens + delta)
if tokens < tonumber(ARGV[4]) then return {0, tokens} end
tokens = tokens - tonumber(ARGV[4])
redis.call('HMSET', KEYS[1], 't', tokens, 'ts', ARGV[3])
redis.call('PEXPIRE', KEYS[1], 60000)
return {1, tokens}
Lua 在 Redis 单线程内原子执行,HMGET 与 HMSET 之间不会有竞态。代价:一个 RTT(同 AZ ~0.3ms)。用 (api_key, model) 哈希分片,Redis Cluster 负责再分片。
flowchart LR
C[Client] --> GW[API Gateway / Envoy]
GW --> RL{限流器 Sidecar}
RL -->|EVALSHA token_bucket.lua| R[(Redis Cluster
按 key 分片)]
RL -->|allow| APP[App / LLM 推理]
RL -->|deny 429 + Retry-After| C
APP --> M[指标: tokens_used]
M -->|异步扣减| R
一致性取舍:全局精确要求中心化 Redis,是个故障域。Cell-based 限流给每 cell 分配一份配额(例如 1/8),异步对账——AWS API Gateway 用这招。
LLM API 的按 token 限流
LLM 推理按 tokens 计价与规划容量,不是按请求。Anthropic 和 OpenAI 每个模型都发布两个独立限制:
- RPM——请求/分钟(网关便宜可查)。
- TPM——tokens/分钟 = 输入 + 输出。你在生成前无法精确知道输出长度。
行业做法是 悲观扣减、完成后退款:
- 统计输入 tokens(tokenizer 对 4k tokens 跑 <1ms)。
- 按
input + max_output在 TPM 桶上预扣。 - 预扣失败 → 429 +
Retry-After= 桶回填到够用需要的秒数。 - 生成结束后退还
(预扣 - 实际输出)。
Anthropic 细节
Anthropic 在响应头返回 anthropic-ratelimit-tokens-remaining、anthropic-ratelimit-requests-remaining 与重置时间戳。客户端应解析并做预测性节流,而不是撞 429 再重试。Messages API 在部分 tier 上还会把输入/输出 TPM 分开。
OpenAI 细节
OpenAI 返回 x-ratelimit-* 和 Retry-After。Tier 1–5 按累计消费升级限额;新租户在 GPT-4 类模型上只有 500 RPM / 30k TPM。Azure OpenAI 用 PTU(Provisioned Throughput Units)模型——独占容量而非共享池。
部署位置、故障模式、可观测性
限流器放在哪?
- 客户端 SDK——合作式,避免浪费 RTT;不可信(可被绕过)。
- 边缘 / API 网关 (Envoy、Kong、AWS API GW)——权威、粗粒度,适合 DDoS 削峰。
- 服务网格 sidecar——每服务、策略一致、集中管理。
- 应用代码——细粒度(例如每操作)。
生产系统四层都有——纵深防御。
故障模式
- Redis 挂了。选项:fail-open(可用性优先)或 fail-closed(阻塞)。Anthropic/OpenAI 都 fail-open 加本地上限,避免 Redis 故障导致全员 429。
- 时钟漂移。用单调时钟或服务端时间,绝不信客户端
now。 - 热点 key。某大户打爆单分片,解法是 key 拆分(
hash(key) % N子桶求和)或本地预取 token。 - 重试风暴。忽略
Retry-After的客户端会放大负载,务必返回带 jitter 的提示。
可观测性
指标:rl_allow_total{tenant,model}、rl_deny_total、rl_bucket_utilization。Top-N 租户 deny/allow > 1% 时告警——通常说明要提额度或客户端有 bug。
面试反模式与清单
面试要避开的反模式
- 负载均衡后面的进程内计数器。每个副本只有 limit/N,窗口边界 2× 突发。
- per-IP 无界 sorted set。10k RPS 的 IP 单 key 60 万时间戳,OOM。
- 没 Lua 的 RMW。两个并发都读到 0 tokens 都通过。必须 EVALSHA 或 WATCH/MULTI/EXEC。
- 忽略输出 tokens。LLM 只按输入限流,16k 输出可以绕过。
- 硬编码限额。放 config 服务或 DB,才能不发版给 Tier-5 客户提额。
白板清单(请大声说出来):身份 + 单位 + 硬/软;算法(滑动窗口计数器默认);Redis + Lua 原子;返回 429 + Retry-After;Redis 故障 fail-open + 本地上限;LLM 按 token 预扣退款;客户端 + 边缘 + sidecar + app 多层;每租户 allow/deny 指标。