The constants table
Memorize the following. In an interview you will not have time to derive them; Alex Xu V1 Ch.2 and the ByteByteGo latency cheat-sheet both publish near-identical versions because they all descend from Jeff Dean's "Numbers Everyone Should Know" (2009, updated by Colin Scott for 2020 hardware).
| Operation | Latency | Mental anchor |
|---|---|---|
| L1 cache hit | 0.5 ns | free |
| Branch mispredict | 5 ns | free |
| L2 cache hit | 7 ns | free |
| Mutex lock/unlock | 25 ns | free |
| Main memory reference | 100 ns | 100× L1 |
| Compress 1 KB w/ Zstd | 2 µs | 20× memory ref |
| Send 2 KB over 1 Gbps | 20 µs | network floor |
| SSD random read (4 KB) | 100 µs | 1000× memory ref |
| Read 1 MB sequentially from SSD | 250 µs | ~4 GB/s SSD BW |
| Round-trip within same DC | 500 µs | half a millisecond |
| Read 1 MB from network | 1 ms | ~1 GB/s DC link |
| Disk seek (HDD) | 10 ms | legacy only |
| TLS handshake (new) | 50–100 ms | RTT × 2 |
| California → Netherlands RTT | 150 ms | speed-of-light floor |
| GPU H100 FLOP/s (BF16) | 989 TFLOP/s | ~1 PFLOP/s nominal |
| H100 HBM3 bandwidth | 3.35 TB/s | memory-bound regime |
Source cross-reference
Rows 1–13 follow Alex Xu V1 Ch.2 almost verbatim; the GPU rows come from NVIDIA H100 datasheets and are essential for any LLM-serving question. For the visual version consult the ByteByteGo "Latency Numbers Every Programmer Should Know" sheet.
Units and powers of two
Mix-ups here lose interviews. Commit these to muscle memory:
- 210 ≈ 103 = 1 K (kilo)
- 220 ≈ 106 = 1 M (mega)
- 230 ≈ 109 = 1 G (giga)
- 240 ≈ 1012 = 1 T (tera)
- 250 ≈ 1015 = 1 P (peta)
- 86 400 s/day ≈ 105. 2.6 × 106 s/month. 3.15 × 107 s/year.
- 1 year ≈ π × 107 seconds (a useful physicist trick).
Corollary: if something happens "once per second per user" and you have 10 M users, that is 10 M/s average. Peak is usually 2–5× average for consumer apps; for B2B with business-hour traffic, peak-to-avg ratio is often 8–10×.
QPS and storage worked examples
Example 1 — Chat app (Alex Xu Ch.12 style)
500 M DAU, 40 messages/day each. Daily writes = 2 × 1010. Average write QPS = 2 × 1010 / 105 = 2 × 105 = 200 k/s. Peak (3×) ≈ 600 k/s. Reads are fan-out: average 3 recipients per message → 600 k/s writes amplify to 1.8 M/s reads. Storage: 200 bytes/message × 2 × 1010 = 4 TB/day = 1.46 PB/year. With 3× replication and 30% overhead → ~6 PB/year actually provisioned.
Example 2 — URL shortener (Alex Xu Ch.8)
100 M new URLs/month → 100M / (2.6 × 106 s) ≈ 40 writes/sec average. 10:1 read-to-write → 400 reads/sec. This is a single Postgres problem, not a shard-from-day-one problem. 10 years × 100 M = 1.2 B rows × 500 bytes = 600 GB. Fits on one NVMe.
Example 3 — Photo upload (Flickr / Acing Ch.12)
10 M uploads/day, 3 MB original + 5 thumbnail sizes (200 KB combined). Write bandwidth: 107 × 3.2 MB / 86 400 s ≈ 370 MB/s ingress. Storage: 32 TB/day → 11.7 PB/year (before replication). Glacier/cold tier at $0.004/GB/mo → $45 k/month for 11.7 PB in cold.
Bandwidth and egress costs
In 2025 on AWS us-east-1, egress is ~$0.09/GB for the first 10 TB/month, dropping to $0.05/GB past 150 TB/month. Serving 1 PB of video out of S3 raw = ~$50 k/month in egress alone, which is why YouTube/Netflix-scale systems use peered CDN egress at $0.002–0.01/GB (50–100× cheaper). This is a concrete number the interviewer will mine for: always mention CDN offload for any read-heavy blob workload.
- 1 Gbps = 125 MB/s = 10.8 TB/day.
- 10 Gbps = ~1.1 PB/day (a fat server NIC).
- 100 Gbps (modern DC spine) = ~10.8 PB/day per link.
Anti-pattern
Forgetting egress math in video/audio/LLM-streaming designs. A candidate who designs "YouTube" without acknowledging that raw S3 egress would cost more than the entire revenue of a small YouTube clone has failed the cost-awareness check.
LLM-token economics
This is the OpenAI/Anthropic-specific layer that the classic books do not cover. Memorize:
- 1 English token ≈ 0.75 words ≈ 4 characters.
- A 2000-word document ≈ 2700 tokens.
- GPT-4-class dense model at 70 B parameters, BF16: ~140 GB weights → fits on 2× H100 (80 GB each) with KV cache.
- Prefill is compute-bound (FLOPs ≈ 2 × params × prompt_tokens); decode is memory-bandwidth-bound (bytes moved ≈ params_bytes per output token).
- On H100 BF16: decode throughput ≈ HBM_BW / params_bytes = 3.35 TB/s ÷ 140 GB ≈ 24 tokens/sec per replica, single-stream. With continuous batching (vLLM-style), aggregate throughput rises 10–50×.
Worked example — 100 k RPS LLM service
See the 100k RPS LLM arena question. Assume 2 k input + 500 output tokens per request, 8 B model. Prefill FLOPs per request ≈ 2 × 8e9 × 2000 = 3.2 × 1013. At H100 300 TFLOP/s effective = ~100 ms/request prefill. Decode at 500 tokens × 40 ms/token (batched) = 20 s/request. To serve 100 k RPS, you need ~2 M concurrent decodes → with continuous batching at 256 concurrency/GPU → ~8000 H100s. At $2/hr on-demand that is $16 k/hour = $140 M/year just in GPUs. This is why quantization + speculative decoding + MoE routing are not optional at OpenAI/Anthropic scale.
OpenAI-specific
OpenAI interviewers will drill on batching vs latency trade: explain the Pareto frontier between per-user tokens/sec and cluster tokens/sec/$, and why they ship different SKUs (Turbo vs Standard) along that curve.
Anthropic-specific
Anthropic cares about the context-length tail: a 200 k-token Claude request has prefill FLOPs 100× a 2 k prompt; KV cache alone is ~40 GB at BF16 for a 70 B model. Mention chunked prefill, paged attention, and prefix-caching for shared system prompts.
Common mistakes and sanity checks
- Off-by-1000 on bytes vs bits. 1 Gbps is 125 MB/s, not 1 GB/s.
- Ignoring replication overhead. Usable storage ≈ raw / 3 for standard RF=3.
- Confusing DAU with QPS. 100 M DAU at 10 actions/day is 11 k/s average, not 100 k/s.
- Forgetting peak multiplier. Peak/avg = 2–5× consumer, 8–10× B2B.
- Treating RTT as one-way. A query + response is 2 RTTs minimum if TLS handshake is involved.
- Quoting HDD seek times (10 ms) for a modern NVMe design. 2008 called.
Sanity check: if your storage-per-year number is bigger than Google's total disclosed capacity, you miscounted a factor of 10. If your QPS is larger than global internet packet rates (~10 B/s), same.
"I'd rather be approximately right than precisely wrong." — John Maynard Keynes, quoted by every staff engineer doing whiteboard math.
常数表
必须背下来。面试时没时间推导;Alex Xu V1 第 2 章与 ByteByteGo 延迟速查表的版本几乎相同,因为它们都来自 Jeff Dean 的「每个工程师都该知道的数字」(2009,Colin Scott 在 2020 年按新硬件更新)。
| 操作 | 延迟 | 记忆锚点 |
|---|---|---|
| L1 缓存命中 | 0.5 ns | 免费 |
| 分支预测失败 | 5 ns | 免费 |
| L2 缓存命中 | 7 ns | 免费 |
| 互斥锁加解锁 | 25 ns | 免费 |
| 主内存引用 | 100 ns | L1 的 100× |
| Zstd 压缩 1 KB | 2 µs | 内存引用的 20× |
| 1 Gbps 发 2 KB | 20 µs | 网络底线 |
| SSD 随机读 4 KB | 100 µs | 内存的 1000× |
| SSD 顺序读 1 MB | 250 µs | ~4 GB/s |
| 同机房 RTT | 500 µs | 半毫秒 |
| 网络读 1 MB | 1 ms | ~1 GB/s |
| HDD 寻道 | 10 ms | 历史遗留 |
| 新 TLS 握手 | 50–100 ms | 2 × RTT |
| 加州 → 荷兰 RTT | 150 ms | 光速底线 |
| H100 BF16 FLOP/s | 989 TFLOP/s | ~1 PFLOP/s 名义 |
| H100 HBM3 带宽 | 3.35 TB/s | memory-bound 区 |
来源交叉引用
前 13 行几乎照搬 Alex Xu V1 第 2 章;GPU 两行来自 NVIDIA H100 数据表,是 LLM 服务题的刚需。可视化版本参考 ByteByteGo「Latency Numbers Every Programmer Should Know」。
单位与 2 的幂
在这里搞错就挂面试。肌肉记忆:
- 210 ≈ 103 = 1 K
- 220 ≈ 106 = 1 M
- 230 ≈ 109 = 1 G
- 240 ≈ 1012 = 1 T
- 250 ≈ 1015 = 1 P
- 86 400 秒/天 ≈ 105;2.6 × 106 秒/月;3.15 × 107 秒/年。
- 1 年 ≈ π × 107 秒(物理学家的小技巧)。
推论:「每人每秒一次」× 1000 万用户 = 1000 万/秒平均。消费类应用峰值 2–5× 平均值,B2B 白天集中型 8–10×。
QPS 与存储实例
例 1 — 聊天应用(Alex Xu 第 12 章风格)
5 亿 DAU × 每人 40 条/天 = 2 × 1010/天。平均写 QPS = 2 × 1010 / 105 = 20 万/秒。峰值(3×)≈ 60 万/秒。读是扇出:平均每条 3 个接收者 → 60 万/秒写放大为 180 万/秒读。存储:每条 200 字节 × 2 × 1010 = 4 TB/天 = 1.46 PB/年。3 副本 + 30% 开销 → 实际配额 ~6 PB/年。
例 2 — 短链接(Alex Xu 第 8 章)
每月 1 亿新 URL → 1e8 / (2.6 × 106 s) ≈ 40 写/秒。10:1 读写比 → 400 读/秒。这是单 Postgres问题,不是第一天就分片的问题。10 年 × 1 亿 = 12 亿行 × 500 字节 = 600 GB,一块 NVMe 就装下。
例 3 — 照片上传(Flickr / Acing 第 12 章)
1000 万上传/天,原图 3 MB + 5 种缩略图(合计 200 KB)。写入带宽:107 × 3.2 MB / 86 400 s ≈ 370 MB/s。存储:32 TB/天 → 11.7 PB/年(未复制)。Glacier 冷层 $0.004/GB/月 → 11.7 PB 冷存约 4.5 万美元/月。
带宽与出口成本
2025 年 AWS us-east-1 出口费前 10 TB/月 ~$0.09/GB,超过 150 TB/月降至 $0.05/GB。从 S3 裸出 1 PB 视频 ≈ 5 万美元/月出口费,所以 YouTube/Netflix 级系统都走peered CDN出口($0.002–0.01/GB,便宜 50–100×)。面试官会追问:读重型 blob 工作负载一定提 CDN 分流。
- 1 Gbps = 125 MB/s = 10.8 TB/天。
- 10 Gbps ≈ 1.1 PB/天(胖服务器网卡)。
- 100 Gbps(现代 DC spine)≈ 每链路 10.8 PB/天。
反模式
视频/音频/LLM 流设计里忘了出口费。候选人若设计「YouTube」时没意识到 S3 裸出口费超过整个小型 YouTube 的收入,就算没通过成本意识检查。
LLM token 经济学
这是经典书不覆盖的 OpenAI/Anthropic 特有一层。背熟:
- 1 个英文 token ≈ 0.75 词 ≈ 4 字符。
- 2000 词文档 ≈ 2700 token。
- GPT-4 级 70 B 参数 BF16:权重 ~140 GB → 2× H100(每块 80 GB)+ KV cache。
- Prefill 受算力限(FLOPs ≈ 2 × 参数数 × prompt_tokens);decode 受显存带宽限(每个输出 token 需搬运 params_bytes)。
- H100 BF16 单流 decode 吞吐 ≈ HBM_BW / params_bytes = 3.35 TB/s ÷ 140 GB ≈ 24 token/秒。continuous batching(vLLM 风格)后总吞吐提升 10–50×。
实例:10 万 RPS LLM 服务
见 100k RPS LLM 真题。假设每请求 2 k 输入 + 500 输出 token,8 B 模型。单次 prefill FLOPs ≈ 2 × 8e9 × 2000 = 3.2 × 1013。H100 有效 300 TFLOP/s → 约 100 ms 预填。decode 500 token × 40 ms/token(batched)= 20 秒。支撑 10 万 RPS 需 ~200 万并发 decode → continuous batching 256 并发/GPU → ~8000 块 H100。on-demand $2/小时 = 1.6 万美元/小时 = 1.4 亿美元/年,纯 GPU。所以在 OpenAI/Anthropic 规模上量化 + 投机解码 + MoE 路由不是可选项。
OpenAI 专属
OpenAI 会追问 batching 与延迟的权衡:讲清每用户 token/秒与集群 token/秒/$ 的 Pareto 前沿,以及为什么产品上出不同 SKU(Turbo vs Standard)沿这条曲线分布。
Anthropic 专属
Anthropic 在意长上下文尾部:20 万 token 的 Claude 请求 prefill FLOPs 是 2 k prompt 的 100×;70 B 模型 BF16 时 KV cache 就有 ~40 GB。提 chunked prefill、paged attention、共享 system prompt 的 prefix caching。
常见错误与合理性检查
- 字节 vs 比特差 1000×。1 Gbps 是 125 MB/s,不是 1 GB/s。
- 忽略复制开销。可用容量 ≈ 裸容量 / 3(RF=3)。
- 混淆 DAU 与 QPS。1 亿 DAU × 10 动作/天 = 1.1 万/秒,不是 10 万/秒。
- 忘记峰值倍数。消费型 2–5×,B2B 8–10×。
- 把 RTT 当单程。查询+响应至少 2 个 RTT,带 TLS 握手更多。
- 引用 HDD 寻道时间(10 ms)做现代 NVMe 设计。2008 打来电话了。
合理性检查:如果你的年度存储大于 Google 公开的总容量,少算/多算了 10 倍;如果 QPS 大于全球互联网包率(~100 亿/秒),同理。
「宁可大约正确,也不要精确错误。」——凯恩斯,白板算数的每位 staff engineer 都引用过。