Resources 资源汇总
Everything you should read, watch, practice against, and star — ranked and annotated for OpenAI and Anthropic system design interview prep. Start with the top three books, layer in the blogs, then use the platforms for drill practice. 为 OpenAI 与 Anthropic 系统设计面试整理的「该读、该看、该练、该 star」清单,全部排名 + 注释。顺序建议:先吃透前三本书,叠加博客,再用平台做真题训练。
① Canonical Books, Ranked ① 核心书单(排名)
Eight books covering distributed systems, ML systems, LLM systems, agents, and the interview process itself. Rank reflects leverage per hour of reading for OpenAI/Anthropic prep, not general quality. 涵盖分布式系统、ML 系统、LLM 系统、Agent 与面试流程的八本书。排名按「每小时阅读对 OpenAI/Anthropic 准备的杠杆」,不代表整体好坏。
| #排名 | Title / Author书名 / 作者 | Why为何必读 | Key chapters关键章节 | Best for最适合 |
|---|---|---|---|---|
| 1 | Designing Data-Intensive Applications Martin Kleppmann · O'Reilly · 2017 |
The distributed-systems bible. Every deep-dive at either company expects DDIA-level mental models.分布式系统圣经。两家公司的深挖都默认你掌握 DDIA 级心智模型。 | Ch.5 (Replication), Ch.7 (Transactions), Ch.9 (Consensus), Ch.11 (Streams) | distributed systems |
| 2 | Designing Machine Learning Systems Chip Huyen · O'Reilly · 2022 |
Canonical ML-system lifecycle reference. Terminology in most ML interviews comes directly from this book.ML 系统生命周期的权威参考。面试里多数术语直接出自本书。 | Ch.7 (Deployment), Ch.8 (Drift), Ch.9 (Continual), Ch.10 (MLOps) | ML systems |
| 3 | System Design Interview Vol. 1 Alex Xu · 2nd ed · 2020 |
The interview-ready templates. Best source of practice answers with clear diagrams.最接近面试的模板书,带清晰图示的练习答案首选。 | Ch.3 (framework), Ch.4 (rate limiter), Ch.6 (KV store), Ch.11 (news feed) | interview templates |
| 4 | System Design Interview Vol. 2 Alex Xu & Sahn Lam · 2022 |
Sequel to Vol. 1 covering real-time, streaming, payments, and storage templates missing from the original — proximity service, metrics/alerting, ad aggregation, S3-like object store.Vol.1 的续集,补齐实时、流式、支付与存储类模板——邻近服务、指标告警、广告聚合、S3 类对象存储。 | Ch.4 (Message Queue), Ch.5 (Metrics/Alerting), Ch.6 (Ad Click Aggregation), Ch.9 (S3 Object Storage) | streaming · storage |
| 5 | Agentic Design Patterns Antonio Gulli · Springer · 2024 |
The canonical taxonomy of agent patterns — routing, reflection, MCP, A2A, guardrails. Directly relevant to Anthropic agent questions.Agent 模式的权威分类:routing、reflection、MCP、A2A、guardrails。Anthropic agent 题目直接命中。 | Ch.5 (Tool Use), Ch.7 (Multi-Agent), Ch.14 (RAG), Ch.18 (Guardrails) | agents · safety |
| 6 | Acing the System Design Interview Zhiyong Tan · Manning · 2024 |
Best single source on the interview process itself — NFRs, reflection, self-assessment, functional partitioning.面试流程最强单本来源:NFR、反思、自评、功能拆分。 | Ch.1-3 (framework+NFRs), Ch.4 (DB scaling), Ch.13 (CDN), Ch.16 (feed) | interview process |
| 7 | Machine Learning Design Interview Khang Pham · 2022 |
Case-by-case ML architectures from YouTube, Feed, Airbnb, LinkedIn. Complements Chip Huyen.YouTube、Feed、Airbnb、LinkedIn 的逐案 ML 架构;与 Chip Huyen 互补。 | Ch.2 (primer), Ch.3 (YouTube), Ch.4 (feed), Ch.7 (Airbnb), Ch.8 (search) | ML case studies |
| 8 | ByteByteGo Big Archive 2023 Alex Xu (compilation) · 2023 |
Visual cheat-sheet for breadth recall — latency numbers, load balancing, DB sharding, real-world tech stacks.宽度记忆的视觉速查:延迟数字、负载均衡、分片、公司真实技术栈。 | Latency numbers; LB algorithms; DB sharding; Kafka deep dive; Netflix/Uber stacks延迟数字;LB 算法;分片;Kafka;Netflix/Uber 架构 | visual recall |
② Essential Blogs & Newsletters ② 必读博客与 Newsletter
Books give you structure; blogs give you freshness. These eight cover 90% of what a prepared candidate cites in 2026. 书提供结构,博客提供新鲜度。以下八个覆盖 2026 年一位合格候选人引用的 90%。
Chip HuyenChip Huyen
huyenchip.com — long-form essays on ML systems, LLM stack, and agent evals; author of DMLS.huyenchip.com——ML 系统、LLM 栈、Agent 评估的长文;DMLS 作者。
Eugene YanEugene Yan
eugeneyan.com — pragmatic ML patterns from Amazon; great on RAG, evaluation, and product-ML.eugeneyan.com——Amazon 视角的务实 ML 模式;RAG、评估、产品 ML 写得极好。
Hamel HusainHamel Husain
hamel.dev — hands-on LLM evals, fine-tuning field notes, and practical agent debugging.hamel.dev——动手的 LLM 评估、微调笔记、Agent 调试实战。
Anthropic EngineeringAnthropic 工程博客
anthropic.com/news — Constitutional AI, RSP, interpretability, tool use. Required reading before any Anthropic round.anthropic.com/news——CAI、RSP、可解释性、工具调用。Anthropic 面试前必读。
OpenAI EngineeringOpenAI 工程博客
openai.com/blog — model launches, system cards, Preparedness framework, API best practices.openai.com/blog——模型发布、system card、Preparedness 框架、API 最佳实践。
High ScalabilityHigh Scalability
highscalability.com — decade of "how X scaled" case studies across Netflix, Discord, Reddit, WhatsApp.highscalability.com——十年的「X 如何扩容」案例库,涵盖 Netflix、Discord、Reddit、WhatsApp。
AWS / GCP ArchitectureAWS / GCP 架构博客
aws.amazon.com/blogs/architecture & cloud.google.com/blog — reference architectures used in production at Fortune 500 customers.AWS 与 Google Cloud 架构博客——500 强客户生产级的参考架构。
The BatchThe Batch
deeplearning.ai/the-batch — Andrew Ng's weekly summary of ML / LLM news, with editorial commentary.deeplearning.ai/the-batch——吴恩达团队每周 ML/LLM 速报,带解读。
③ Courses ③ 课程
Pick one paid course; do not buy multiple. They overlap heavily and the returns on a second course are low. 付费课程挑一个即可,别重复买。内容高度重叠,第二门的边际收益很低。
ByteByteGoByteByteGo
bytebytego.com — Alex Xu's video course; same visual style as the books. Good for passive review.bytebytego.com——Alex Xu 的视频课,延续书的视觉风格;适合被动复习。
subscription订阅制Grokking the System Design InterviewGrokking 系统设计面试
Educative.io — the longest-standing interactive course; text-based with inline diagrams.Educative.io——口碑最老的交互式课程,文本为主,内嵌图示。
one-time一次买断Hello InterviewHello Interview
hellointerview.com — FAANG-focused with mock-interview videos; newer and targeted at L5+/staff.hellointerview.com——FAANG 导向,含模面视频;较新,面向 L5+/Staff 级。
freemium半免费ExponentExponent
tryexponent.com — peer mock interviews + curated video answers; strongest for actually practicing out loud.tryexponent.com——同伴互练 + 精选视频答案;练「说」这项最强。
subscription订阅制④ Interview Practice Platforms ④ 面试练习平台
Cross-reference what interviewers at OpenAI and Anthropic are actually asking in the last 3 months. Signal-to-noise varies; use multiple. 交叉验证过去 3 个月 OpenAI 与 Anthropic 面试官实际在问什么。信噪比差异大,建议多平台对照。
LeetCode Discuss — System DesignLeetCode Discuss · 系统设计
leetcode.com/discuss — search by company tag ("OpenAI", "Anthropic") for freshest leaks.leetcode.com/discuss——按公司标签搜「OpenAI」「Anthropic」可看最新泄题。
TeamBlindTeamBlind
teamblind.com — search "Anthropic system design" and "OpenAI onsite". Often has loop structure, interviewer names, and comp.teamblind.com——搜「Anthropic system design」「OpenAI onsite」常能拿到面试流程、面试官风格、薪资。
JointaroJointaro
jointaro.com — senior-engineer community, staff-level interview threads, coaching sessions.jointaro.com——高级工程师社区,Staff 级面试帖,付费 coaching。
PracHubPracHub
prachub.com — crowd-sourced interview-question DB; tag-filter by company + round.prachub.com——众包面经库,可按公司 + 轮次筛选。
GlassdoorGlassdoor
glassdoor.com — search "Anthropic interview questions" / "OpenAI interview questions". Older but broad.glassdoor.com——搜 Anthropic / OpenAI 面试题。陈旧但面广。
⑤ Key Papers to Memorise ⑤ 必背论文
You do not need to have read each end-to-end — but you must know the core claim, the key numbers, and the one diagram each paper is famous for. Interviewers love "so what's the intuition behind X" questions. 不必一字不落地读完——但必须会讲每篇的核心主张、关键数字和标志性图。面试官很爱问「X 背后的直觉是什么」。
| Paper论文 | Year年份 | Why it matters为何重要 |
|---|---|---|
| Dynamo | 2007 | Eventually consistent KV store; consistent hashing + quorum + vector clocks. Alex Xu Ch.6 is a direct descendant.最终一致的 KV 存储;一致性哈希+quorum+vector clock。Alex Xu 第 6 章即其直系后裔。 |
| Bigtable | 2006 | LSM-tree-based wide column store; direct ancestor of HBase, Cassandra, ScyllaDB.基于 LSM 的宽列存储;HBase、Cassandra、ScyllaDB 的直系祖先。 |
| Kafka | 2011 | Durable, partitioned, replayable log. The default answer to "how do your services talk?"持久、分区、可重放日志。「你的服务之间怎么通信」的默认答案。 |
| MapReduce | 2004 | Batch-processing abstraction that launched the entire big-data ecosystem.启动整个大数据生态的批处理抽象。 |
| Raft | 2014 | Understandable consensus. Know leader election, log replication, and safety properties.可理解版的共识。leader 选举、日志复制、安全性属性要能讲。 |
| vLLM / PagedAttention | 2023 | Kwon et al. — KV cache as paged memory; 2-4x throughput over HuggingFace. Foundational for LLM serving interviews.Kwon 等——KV cache 分页管理,相对 HF 吞吐 2-4x。LLM 推理面试必备。 |
| FlashAttention | 2022 | Dao et al. — IO-aware attention kernel; linear memory instead of quadratic. Enables long-context training.Dao 等——IO 感知 attention kernel;线性内存代替二次方。使长上下文训练成为可能。 |
| Speculative Decoding | 2023 | Leviathan et al. — draft-verify loop gives 2-3x decode speedup without quality loss.Leviathan 等——draft-verify 循环,质量无损下解码 2-3x 提速。 |
| Constitutional AI | 2022 | Bai et al. — self-critique by written principles; foundation of Anthropic's alignment stack.Bai 等——按书面原则自我批判;Anthropic 对齐栈的地基。 |
| GPT-3 | 2020 | Brown et al. — in-context / few-shot learning at 175B scale. The paper that changed the field.Brown 等——175B 规模的 in-context/few-shot 学习。改写行业的论文。 |
| GPT-4 Technical Report | 2023 | OpenAI — predictable scaling, system card structure, and the "model spec" approach.OpenAI——可预测的扩展、system card 结构、model spec 方法论。 |
| Megatron-LM | 2019 | Shoeybi et al. — tensor parallelism recipe still used everywhere in 2026.Shoeybi 等——2026 年仍随处可见的 tensor 并行方案。 |
| ZeRO / FSDP | 2019 | Rajbhandari et al. — optimiser / grad / param sharding; lets you fit 100B+ on commodity clusters.Rajbhandari 等——优化器/梯度/参数分片;百亿+ 也能在普通集群训练。 |
| Chinchilla | 2022 | Hoffmann et al. — compute-optimal scaling: 20 tokens per parameter. Overturned GPT-3 intuitions.Hoffmann 等——算力最优扩展:每参数 20 token。推翻 GPT-3 直觉。 |
| Scaling Laws | 2020 | Kaplan et al. — predictable loss vs compute/params/data; the planning tool for all foundation-model teams.Kaplan 等——loss 对算力/参数/数据的可预测关系;所有基础模型团队的规划工具。 |
| LoRA | 2021 | Hu et al. — low-rank adapters; the default fine-tuning method for 2024-2026.Hu 等——低秩 adapter;2024-2026 默认微调方法。 |
⑥ GitHub Repos to Know ⑥ 必知 GitHub 仓库
Skim the READMEs and scan one core file each. Being able to say "I have actually looked at vLLM's scheduler" separates serious candidates from the rest. 至少扫一遍每个仓库的 README 并读一个核心文件。能说出「我真的看过 vLLM 的调度器」的候选人明显更有竞争力。
vLLMvLLM
PagedAttention + continuous batching reference implementation. Read vllm/engine/llm_engine.py.PagedAttention + continuous batching 参考实现。重点看 vllm/engine/llm_engine.py。
DeepSpeedDeepSpeed
ZeRO / 3D-parallel training library. Study the ZeRO stage-3 docs.ZeRO / 3D 并行训练库。重点看 ZeRO stage-3 文档。
Megatron-LMMegatron-LM
NVIDIA's tensor-parallel reference for large LLMs; canonical TP splits.NVIDIA 大 LLM 的 tensor 并行参考;经典 TP 切分。
TensorRT-LLMTensorRT-LLM
Production inference runtime on NVIDIA GPUs; fused kernels, quantisation, in-flight batching.NVIDIA GPU 上的生产推理运行时;融合算子、量化、in-flight batching。
PyTorch FSDPPyTorch FSDP
Fully Sharded Data Parallel inside PyTorch core. See torch/distributed/fsdp/.PyTorch 核内置的 FSDP。重点看 torch/distributed/fsdp/。
RayRay
Distributed task runtime powering many LLM training / serving stacks; Ray Serve for online.众多 LLM 训练/推理栈背后的分布式任务运行时;Ray Serve 用于在线。
llama.cppllama.cpp
CPU + quantised inference reference. Go-to when the interview turns to edge / on-device.CPU + 量化推理参考。边缘/端上话题的首选示例。
Anthropic Performance Take-homeAnthropic 性能 Take-home
github.com/anthropics/performance-takehome — their public perf-engineering take-home; a direct window into the bar.github.com/anthropics/performance-takehome——Anthropic 公开的性能工程 take-home;能直接窥见他们的标准。