Back-of-Envelope Estimation

The constants table

Memorize the following. In an interview you will not have time to derive them; Alex Xu V1 Ch.2 and the ByteByteGo latency cheat-sheet both publish near-identical versions because they all descend from Jeff Dean's "Numbers Everyone Should Know" (2009, updated by Colin Scott for 2020 hardware).

Operation	Latency	Mental anchor
L1 cache hit	0.5 ns	free
Branch mispredict	5 ns	free
L2 cache hit	7 ns	free
Mutex lock/unlock	25 ns	free
Main memory reference	100 ns	100× L1
Compress 1 KB w/ Zstd	2 µs	20× memory ref
Send 2 KB over 1 Gbps	20 µs	network floor
SSD random read (4 KB)	100 µs	1000× memory ref
Read 1 MB sequentially from SSD	250 µs	~4 GB/s SSD BW
Round-trip within same DC	500 µs	half a millisecond
Read 1 MB from network	1 ms	~1 GB/s DC link
Disk seek (HDD)	10 ms	legacy only
TLS handshake (new)	50–100 ms	RTT × 2
California → Netherlands RTT	150 ms	speed-of-light floor
GPU H100 FLOP/s (BF16)	989 TFLOP/s	~1 PFLOP/s nominal
H100 HBM3 bandwidth	3.35 TB/s	memory-bound regime

Source cross-reference

Rows 1–13 follow Alex Xu V1 Ch.2 almost verbatim; the GPU rows come from NVIDIA H100 datasheets and are essential for any LLM-serving question. For the visual version consult the ByteByteGo "Latency Numbers Every Programmer Should Know" sheet.

Units and powers of two

Mix-ups here lose interviews. Commit these to muscle memory:

2¹⁰ ≈ 10³ = 1 K (kilo)
2²⁰ ≈ 10⁶ = 1 M (mega)
2³⁰ ≈ 10⁹ = 1 G (giga)
2⁴⁰ ≈ 10¹² = 1 T (tera)
2⁵⁰ ≈ 10¹⁵ = 1 P (peta)
86 400 s/day ≈ 10⁵. 2.6 × 10⁶ s/month. 3.15 × 10⁷ s/year.
1 year ≈ π × 10⁷ seconds (a useful physicist trick).

Corollary: if something happens "once per second per user" and you have 10 M users, that is 10 M/s average. Peak is usually 2–5× average for consumer apps; for B2B with business-hour traffic, peak-to-avg ratio is often 8–10×.

QPS and storage worked examples

Example 1 — Chat app (Alex Xu Ch.12 style)

500 M DAU, 40 messages/day each. Daily writes = 2 × 10¹⁰. Average write QPS = 2 × 10¹⁰ / 10⁵ = 2 × 10⁵ = 200 k/s. Peak (3×) ≈ 600 k/s. Reads are fan-out: average 3 recipients per message → 600 k/s writes amplify to 1.8 M/s reads. Storage: 200 bytes/message × 2 × 10¹⁰ = 4 TB/day = 1.46 PB/year. With 3× replication and 30% overhead → ~6 PB/year actually provisioned.

Example 2 — URL shortener (Alex Xu Ch.8)

100 M new URLs/month → 100M / (2.6 × 10⁶ s) ≈ 40 writes/sec average. 10:1 read-to-write → 400 reads/sec. This is a single Postgres problem, not a shard-from-day-one problem. 10 years × 100 M = 1.2 B rows × 500 bytes = 600 GB. Fits on one NVMe.

Example 3 — Photo upload (Flickr / Acing Ch.12)

10 M uploads/day, 3 MB original + 5 thumbnail sizes (200 KB combined). Write bandwidth: 10⁷ × 3.2 MB / 86 400 s ≈ 370 MB/s ingress. Storage: 32 TB/day → 11.7 PB/year (before replication). Glacier/cold tier at $0.004/GB/mo → $45 k/month for 11.7 PB in cold.

Bandwidth and egress costs

In 2025 on AWS us-east-1, egress is ~$0.09/GB for the first 10 TB/month, dropping to $0.05/GB past 150 TB/month. Serving 1 PB of video out of S3 raw = ~$50 k/month in egress alone, which is why YouTube/Netflix-scale systems use peered CDN egress at $0.002–0.01/GB (50–100× cheaper). This is a concrete number the interviewer will mine for: always mention CDN offload for any read-heavy blob workload.

1 Gbps = 125 MB/s = 10.8 TB/day.
10 Gbps = ~1.1 PB/day (a fat server NIC).
100 Gbps (modern DC spine) = ~10.8 PB/day per link.

Anti-pattern

Forgetting egress math in video/audio/LLM-streaming designs. A candidate who designs "YouTube" without acknowledging that raw S3 egress would cost more than the entire revenue of a small YouTube clone has failed the cost-awareness check.

LLM-token economics

This is the OpenAI/Anthropic-specific layer that the classic books do not cover. Memorize:

1 English token ≈ 0.75 words ≈ 4 characters.
A 2000-word document ≈ 2700 tokens.
GPT-4-class dense model at 70 B parameters, BF16: ~140 GB weights → fits on 2× H100 (80 GB each) with KV cache.
Prefill is compute-bound (FLOPs ≈ 2 × params × prompt_tokens); decode is memory-bandwidth-bound (bytes moved ≈ params_bytes per output token).
On H100 BF16: decode throughput ≈ HBM_BW / params_bytes = 3.35 TB/s ÷ 140 GB ≈ 24 tokens/sec per replica, single-stream. With continuous batching (vLLM-style), aggregate throughput rises 10–50×.

Worked example — 100 k RPS LLM service

See the 100k RPS LLM arena question. Assume 2 k input + 500 output tokens per request, 8 B model. Prefill FLOPs per request ≈ 2 × 8e9 × 2000 = 3.2 × 10¹³. At H100 300 TFLOP/s effective = ~100 ms/request prefill. Decode at 500 tokens × 40 ms/token (batched) = 20 s/request. To serve 100 k RPS, you need ~2 M concurrent decodes → with continuous batching at 256 concurrency/GPU → ~8000 H100s. At $2/hr on-demand that is $16 k/hour = $140 M/year just in GPUs. This is why quantization + speculative decoding + MoE routing are not optional at OpenAI/Anthropic scale.

OpenAI-specific

OpenAI interviewers will drill on batching vs latency trade: explain the Pareto frontier between per-user tokens/sec and cluster tokens/sec/$, and why they ship different SKUs (Turbo vs Standard) along that curve.

Anthropic-specific

Anthropic cares about the context-length tail: a 200 k-token Claude request has prefill FLOPs 100× a 2 k prompt; KV cache alone is ~40 GB at BF16 for a 70 B model. Mention chunked prefill, paged attention, and prefix-caching for shared system prompts.

Common mistakes and sanity checks

Off-by-1000 on bytes vs bits. 1 Gbps is 125 MB/s, not 1 GB/s.
Ignoring replication overhead. Usable storage ≈ raw / 3 for standard RF=3.
Confusing DAU with QPS. 100 M DAU at 10 actions/day is 11 k/s average, not 100 k/s.
Forgetting peak multiplier. Peak/avg = 2–5× consumer, 8–10× B2B.
Treating RTT as one-way. A query + response is 2 RTTs minimum if TLS handshake is involved.
Quoting HDD seek times (10 ms) for a modern NVMe design. 2008 called.

Sanity check: if your storage-per-year number is bigger than Google's total disclosed capacity, you miscounted a factor of 10. If your QPS is larger than global internet packet rates (~10 B/s), same.

"I'd rather be approximately right than precisely wrong." — John Maynard Keynes, quoted by every staff engineer doing whiteboard math.

常数表

必须背下来。面试时没时间推导；Alex Xu V1 第 2 章与 ByteByteGo 延迟速查表的版本几乎相同，因为它们都来自 Jeff Dean 的「每个工程师都该知道的数字」（2009，Colin Scott 在 2020 年按新硬件更新）。

操作	延迟	记忆锚点
L1 缓存命中	0.5 ns	免费
分支预测失败	5 ns	免费
L2 缓存命中	7 ns	免费
互斥锁加解锁	25 ns	免费
主内存引用	100 ns	L1 的 100×
Zstd 压缩 1 KB	2 µs	内存引用的 20×
1 Gbps 发 2 KB	20 µs	网络底线
SSD 随机读 4 KB	100 µs	内存的 1000×
SSD 顺序读 1 MB	250 µs	~4 GB/s
同机房 RTT	500 µs	半毫秒
网络读 1 MB	1 ms	~1 GB/s
HDD 寻道	10 ms	历史遗留
新 TLS 握手	50–100 ms	2 × RTT
加州 → 荷兰 RTT	150 ms	光速底线
H100 BF16 FLOP/s	989 TFLOP/s	~1 PFLOP/s 名义
H100 HBM3 带宽	3.35 TB/s	memory-bound 区

来源交叉引用

前 13 行几乎照搬 Alex Xu V1 第 2 章；GPU 两行来自 NVIDIA H100 数据表，是 LLM 服务题的刚需。可视化版本参考 ByteByteGo「Latency Numbers Every Programmer Should Know」。

单位与 2 的幂

在这里搞错就挂面试。肌肉记忆：

2¹⁰ ≈ 10³ = 1 K
2²⁰ ≈ 10⁶ = 1 M
2³⁰ ≈ 10⁹ = 1 G
2⁴⁰ ≈ 10¹² = 1 T
2⁵⁰ ≈ 10¹⁵ = 1 P
86 400 秒/天 ≈ 10⁵；2.6 × 10⁶ 秒/月；3.15 × 10⁷ 秒/年。
1 年 ≈ π × 10⁷ 秒（物理学家的小技巧）。

推论：「每人每秒一次」× 1000 万用户 = 1000 万/秒平均。消费类应用峰值 2–5× 平均值，B2B 白天集中型 8–10×。

QPS 与存储实例

例 1 — 聊天应用（Alex Xu 第 12 章风格）

5 亿 DAU × 每人 40 条/天 = 2 × 10¹⁰/天。平均写 QPS = 2 × 10¹⁰ / 10⁵ = 20 万/秒。峰值（3×）≈ 60 万/秒。读是扇出：平均每条 3 个接收者 → 60 万/秒写放大为 180 万/秒读。存储：每条 200 字节 × 2 × 10¹⁰ = 4 TB/天 = 1.46 PB/年。3 副本 + 30% 开销 → 实际配额 ~6 PB/年。

例 2 — 短链接（Alex Xu 第 8 章）

每月 1 亿新 URL → 1e8 / (2.6 × 10⁶ s) ≈ 40 写/秒。10:1 读写比 → 400 读/秒。这是单 Postgres问题，不是第一天就分片的问题。10 年 × 1 亿 = 12 亿行 × 500 字节 = 600 GB，一块 NVMe 就装下。

例 3 — 照片上传（Flickr / Acing 第 12 章）

1000 万上传/天，原图 3 MB + 5 种缩略图（合计 200 KB）。写入带宽：10⁷ × 3.2 MB / 86 400 s ≈ 370 MB/s。存储：32 TB/天 → 11.7 PB/年（未复制）。Glacier 冷层 $0.004/GB/月 → 11.7 PB 冷存约 4.5 万美元/月。

带宽与出口成本

2025 年 AWS us-east-1 出口费前 10 TB/月 ~$0.09/GB，超过 150 TB/月降至 $0.05/GB。从 S3 裸出 1 PB 视频 ≈ 5 万美元/月出口费，所以 YouTube/Netflix 级系统都走peered CDN出口（$0.002–0.01/GB，便宜 50–100×）。面试官会追问：读重型 blob 工作负载一定提 CDN 分流。

1 Gbps = 125 MB/s = 10.8 TB/天。
10 Gbps ≈ 1.1 PB/天（胖服务器网卡）。
100 Gbps（现代 DC spine）≈ 每链路 10.8 PB/天。

反模式

视频/音频/LLM 流设计里忘了出口费。候选人若设计「YouTube」时没意识到 S3 裸出口费超过整个小型 YouTube 的收入，就算没通过成本意识检查。

LLM token 经济学

这是经典书不覆盖的 OpenAI/Anthropic 特有一层。背熟：

1 个英文 token ≈ 0.75 词 ≈ 4 字符。
2000 词文档 ≈ 2700 token。
GPT-4 级 70 B 参数 BF16：权重 ~140 GB → 2× H100（每块 80 GB）+ KV cache。
Prefill 受算力限（FLOPs ≈ 2 × 参数数 × prompt_tokens）；decode 受显存带宽限（每个输出 token 需搬运 params_bytes）。
H100 BF16 单流 decode 吞吐 ≈ HBM_BW / params_bytes = 3.35 TB/s ÷ 140 GB ≈ 24 token/秒。continuous batching（vLLM 风格）后总吞吐提升 10–50×。

实例：10 万 RPS LLM 服务

见 100k RPS LLM 真题。假设每请求 2 k 输入 + 500 输出 token，8 B 模型。单次 prefill FLOPs ≈ 2 × 8e9 × 2000 = 3.2 × 10¹³。H100 有效 300 TFLOP/s → 约 100 ms 预填。decode 500 token × 40 ms/token（batched）= 20 秒。支撑 10 万 RPS 需 ~200 万并发 decode → continuous batching 256 并发/GPU → ~8000 块 H100。on-demand $2/小时 = 1.6 万美元/小时 = 1.4 亿美元/年，纯 GPU。所以在 OpenAI/Anthropic 规模上量化 + 投机解码 + MoE 路由不是可选项。

OpenAI 专属

OpenAI 会追问 batching 与延迟的权衡：讲清每用户 token/秒与集群 token/秒/$ 的 Pareto 前沿，以及为什么产品上出不同 SKU（Turbo vs Standard）沿这条曲线分布。

Anthropic 专属

Anthropic 在意长上下文尾部：20 万 token 的 Claude 请求 prefill FLOPs 是 2 k prompt 的 100×；70 B 模型 BF16 时 KV cache 就有 ~40 GB。提 chunked prefill、paged attention、共享 system prompt 的 prefix caching。

常见错误与合理性检查

字节 vs 比特差 1000×。1 Gbps 是 125 MB/s，不是 1 GB/s。
忽略复制开销。可用容量 ≈ 裸容量 / 3（RF=3）。
混淆 DAU 与 QPS。1 亿 DAU × 10 动作/天 = 1.1 万/秒，不是 10 万/秒。
忘记峰值倍数。消费型 2–5×，B2B 8–10×。
把 RTT 当单程。查询+响应至少 2 个 RTT，带 TLS 握手更多。
引用 HDD 寻道时间（10 ms）做现代 NVMe 设计。2008 打来电话了。

合理性检查：如果你的年度存储大于 Google 公开的总容量，少算/多算了 10 倍；如果 QPS 大于全球互联网包率（~100 亿/秒），同理。

「宁可大约正确，也不要精确错误。」——凯恩斯，白板算数的每位 staff engineer 都引用过。