Study Guide 系统学习手册
Deep, cross-cutting notes on every major topic that OpenAI and Anthropic interviews draw from — synthesised from DDIA, Alex Xu V1, Chip Huyen's DMLS, Agentic Design Patterns, ByteByteGo, Acing the System Design Interview, and Machine Learning System Design Interview. 把 OpenAI 与 Anthropic 面试涵盖的核心主题做深度、横向对比的学习笔记——系统综合自 DDIA、Alex Xu V1、Chip Huyen《DMLS》、《Agentic Design Patterns》、ByteByteGo、《Acing the System Design Interview》、《Machine Learning System Design Interview》。
How to use this guide如何使用本手册
Each topic page is self-contained (1,200–2,500 words) with: principles, trade-offs, concrete numbers, diagrams, and links back to the Arena questions that exercise it. Reading order doesn't matter — start from whatever your weakest area is. Every page cites the original book and chapter so you can dive deeper when needed. 每个主题页自成体系(1,200–2,500 字),包含:原理、权衡、具体数字、图示,以及回链到会考察它的真题 Arena。阅读顺序不限——从最弱的一项切入即可。每页都标注了原书章节,方便深入挖掘。
① Foundations ① 基础功
Before any domain knowledge, you need a repeatable interview framework and the ability to do rapid numeric estimation on a whiteboard. These two skills separate candidates who "know system design" from those who can actually do it under time pressure. 在进入领域知识之前,你需要一个可复用的面试框架与白板上快速估算的能力。这两项技能把「懂系统设计」的人与能在时间压力下真正做出来的人分开。
The Interview Framework面试通用框架
A 45-minute-tested structure: requirements → scale → API → data model → architecture → deep dive → trade-offs. With timing budgets.45 分钟实测结构:需求 → 规模 → API → 数据模型 → 架构 → 深挖 → 权衡。含时间预算。
Back-of-Envelope Estimation快速估算
The numbers every senior engineer has memorised: latency constants, storage densities, network bandwidth, LLM token economics.每位资深工程师都该背过的数字:延迟常数、存储密度、网络带宽、LLM token 经济。
② Distributed Systems Core ② 分布式系统核心
These six topics underpin every "Design X" question. They're also where interviewers probe most aggressively for depth — being able to cleanly articulate CAP, linearizability vs. serializability, quorum math, and LSM vs. B-tree trade-offs signals senior-level thinking. 这六个主题支撑所有「设计 X」题目,也是面试官最爱深挖之处——把 CAP、可线性化 vs 可串行化、quorum 数学、LSM vs B-tree 说清楚就是资深信号。
Replication & Consistency复制与一致性
Single-leader, multi-leader, leaderless. CAP vs PACELC. Read-your-writes, monotonic reads, linearizability.单主、多主、无主。CAP vs PACELC。读己之写、单调读、线性一致性。
Partitioning & Sharding分区与分片
Hash vs range. Consistent hashing. Hot-partition mitigation. Rebalancing. Secondary indexes in partitioned DBs.哈希 vs 范围。一致性哈希。热点缓解。再平衡。分区 DB 的二级索引。
Consensus & Coordination共识与协调
Paxos / Raft intuition. Leader election. ZooKeeper / etcd patterns. Fencing tokens.Paxos / Raft 直觉。Leader 选举。ZooKeeper / etcd 模式。Fencing token。
Transactions & Isolation事务与隔离
ACID. Weak isolation anomalies. Snapshot isolation, SSI. Distributed transactions, Saga, 2PC.ACID。弱隔离异常。快照隔离、SSI。分布式事务、Saga、2PC。
Stream & Batch Processing流处理与批处理
Kafka. MapReduce. Lambda vs Kappa. Exactly-once. Event time vs processing time.Kafka。MapReduce。Lambda vs Kappa。Exactly-once。事件时间 vs 处理时间。
Storage Engines存储引擎
B-trees vs LSM trees. Write/read amplification. Column stores. Compaction strategies.B-tree vs LSM tree。读写放大。列存储。Compaction 策略。
③ Classical System Designs ③ 经典系统设计
Even AI-first companies ask these. OpenAI's Slack, Anthropic's chat service, webhook platforms — the fundamentals compound with the AI layer. 即便是 AI-first 公司也会问这些。OpenAI 的 Slack、Anthropic 的 chat 服务、Webhook 平台——基础积木与 AI 层层叠加。
Rate Limiter限流器
Token bucket, leaky bucket, sliding window. Distributed Redis + Lua. Token-based limiting for LLM APIs.令牌桶、漏桶、滑窗。分布式 Redis+Lua。LLM API 的 token 级限流。
Feeds, Chat & NotificationsFeed、聊天与推送
Fan-out on write vs read. Slack channels. Push/pull hybrids. Celebrity problem.写扩散 vs 读扩散。Slack channel。Push/pull 混合。明星用户问题。
Webhook Delivery & Job QueuesWebhook 投递与任务队列
At-least-once delivery, idempotency, retry with backoff, DLQ. Poison-pill isolation.至少一次投递、幂等、指数退避、死信队列。毒丸隔离。
④ LLM Systems (the core differentiator) ④ LLM 系统(最大差异化)
This is what OpenAI and Anthropic care about most. If you can only go deep on one section of this guide, make it this one. Expect questions about serving internals (KV cache, continuous batching, speculative decoding), RAG, agents, evals, and distributed training. 这是 OpenAI 与 Anthropic 最看重的一块。若只能深入学一节,就选这里。请准备:推理内部(KV cache、continuous batching、speculative decoding)、RAG、Agent、评估与分布式训练。
LLM Serving & InferenceLLM 推理服务
Prefill vs decode. KV cache & PagedAttention. Continuous batching. Speculative decoding. TTFT vs ITL.Prefill vs decode。KV cache 与 PagedAttention。连续批处理。投机解码。TTFT vs ITL。
RAG ArchitectureRAG 架构
Chunking strategy, hybrid retrieval, cross-encoder rerank, HyDE, citation grounding, multi-modal.Chunk 策略、混合检索、交叉编码重排、HyDE、引用锚定、多模态。
Agentic Design PatternsAgent 设计模式
Tool use, ReAct, planner/executor split, multi-agent orchestration, MCP, memory, reflection.工具调用、ReAct、规划-执行分离、多 agent 编排、MCP、记忆、反思。
LLM EvaluationLLM 评估
Offline benchmarks, LLM-as-judge bias, regression suites, online A/B with guardrails, human raters.离线基准、LLM-as-judge 偏差、回归测试、带护栏的在线 A/B、人工标注。
Distributed Training分布式训练
DP, TP, PP, ZeRO/FSDP, 3D parallelism. Checkpointing, gradient accumulation, mixed precision.DP、TP、PP、ZeRO/FSDP、3D 并行。Checkpointing、梯度累积、混合精度。
⑤ ML System Design ⑤ ML 系统设计
Classical ML is still part of the loop — especially ranking, recommendation, and moderation. Chip Huyen's Designing ML Systems is the backbone of this section. 经典 ML 依然重要——尤其是排序、推荐与审核。本节以 Chip Huyen《Designing ML Systems》为主干。
ML Lifecycle & PlatformML 生命周期与平台
Feature store, training pipeline, model registry, serving, shadow mode, canary.特征平台、训练流水线、模型注册表、服务化、shadow、灰度。
Drift Detection & Monitoring漂移检测与监控
Data/label/concept drift. PSI, KS, KL. Monitoring stack (Arize, WhyLabs). LLM-specific drift.数据/标签/概念漂移。PSI、KS、KL。监控栈(Arize、WhyLabs)。LLM 专属漂移。
Recommenders & Ranking推荐与排序
Candidate generation, retrieval, ranking. Two-tower, wide & deep. Embedding-based recall. LLM-assisted ranking.候选生成、召回、排序。双塔、Wide&Deep。基于 embedding 的召回。LLM 协助排序。
⑥ Safety & Alignment Engineering ⑥ 安全与对齐工程
Unique to Anthropic and increasingly to OpenAI — expect at least one question here. Constitutional AI, jailbreak defence, red-teaming infrastructure, and content moderation pipelines. Anthropic 独有,OpenAI 也越来越重视——至少一题会出现在这里。宪法式 AI、越狱防御、红队基建、内容审核流水线。