Resources — SD-Guide

① Canonical Books, Ranked ① 核心书单（排名）

Eight books covering distributed systems, ML systems, LLM systems, agents, and the interview process itself. Rank reflects leverage per hour of reading for OpenAI/Anthropic prep, not general quality. 涵盖分布式系统、ML 系统、LLM 系统、Agent 与面试流程的八本书。排名按「每小时阅读对 OpenAI/Anthropic 准备的杠杆」，不代表整体好坏。

#排名	Title / Author书名 / 作者	Why为何必读	Key chapters关键章节	Best for最适合
1	Designing Data-Intensive Applications Martin Kleppmann · O'Reilly · 2017	The distributed-systems bible. Every deep-dive at either company expects DDIA-level mental models.分布式系统圣经。两家公司的深挖都默认你掌握 DDIA 级心智模型。	Ch.5 (Replication), Ch.7 (Transactions), Ch.9 (Consensus), Ch.11 (Streams)	distributed systems
2	Designing Machine Learning Systems Chip Huyen · O'Reilly · 2022	Canonical ML-system lifecycle reference. Terminology in most ML interviews comes directly from this book.ML 系统生命周期的权威参考。面试里多数术语直接出自本书。	Ch.7 (Deployment), Ch.8 (Drift), Ch.9 (Continual), Ch.10 (MLOps)	ML systems
3	System Design Interview Vol. 1 Alex Xu · 2nd ed · 2020	The interview-ready templates. Best source of practice answers with clear diagrams.最接近面试的模板书，带清晰图示的练习答案首选。	Ch.3 (framework), Ch.4 (rate limiter), Ch.6 (KV store), Ch.11 (news feed)	interview templates
4	System Design Interview Vol. 2 Alex Xu & Sahn Lam · 2022	Sequel to Vol. 1 covering real-time, streaming, payments, and storage templates missing from the original — proximity service, metrics/alerting, ad aggregation, S3-like object store.Vol.1 的续集，补齐实时、流式、支付与存储类模板——邻近服务、指标告警、广告聚合、S3 类对象存储。	Ch.4 (Message Queue), Ch.5 (Metrics/Alerting), Ch.6 (Ad Click Aggregation), Ch.9 (S3 Object Storage)	streaming · storage
5	Agentic Design Patterns Antonio Gulli · Springer · 2024	The canonical taxonomy of agent patterns — routing, reflection, MCP, A2A, guardrails. Directly relevant to Anthropic agent questions.Agent 模式的权威分类：routing、reflection、MCP、A2A、guardrails。Anthropic agent 题目直接命中。	Ch.5 (Tool Use), Ch.7 (Multi-Agent), Ch.14 (RAG), Ch.18 (Guardrails)	agents · safety
6	Acing the System Design Interview Zhiyong Tan · Manning · 2024	Best single source on the interview process itself — NFRs, reflection, self-assessment, functional partitioning.面试流程最强单本来源：NFR、反思、自评、功能拆分。	Ch.1-3 (framework+NFRs), Ch.4 (DB scaling), Ch.13 (CDN), Ch.16 (feed)	interview process
7	Machine Learning Design Interview Khang Pham · 2022	Case-by-case ML architectures from YouTube, Feed, Airbnb, LinkedIn. Complements Chip Huyen.YouTube、Feed、Airbnb、LinkedIn 的逐案 ML 架构；与 Chip Huyen 互补。	Ch.2 (primer), Ch.3 (YouTube), Ch.4 (feed), Ch.7 (Airbnb), Ch.8 (search)	ML case studies
8	ByteByteGo Big Archive 2023 Alex Xu (compilation) · 2023	Visual cheat-sheet for breadth recall — latency numbers, load balancing, DB sharding, real-world tech stacks.宽度记忆的视觉速查：延迟数字、负载均衡、分片、公司真实技术栈。	Latency numbers; LB algorithms; DB sharding; Kafka deep dive; Netflix/Uber stacks延迟数字；LB 算法；分片；Kafka；Netflix/Uber 架构	visual recall

② Essential Blogs & Newsletters ② 必读博客与 Newsletter

Books give you structure; blogs give you freshness. These eight cover 90% of what a prepared candidate cites in 2026. 书提供结构，博客提供新鲜度。以下八个覆盖 2026 年一位合格候选人引用的 90%。

Chip HuyenChip Huyen

huyenchip.com — long-form essays on ML systems, LLM stack, and agent evals; author of DMLS.huyenchip.com——ML 系统、LLM 栈、Agent 评估的长文；DMLS 作者。

Eugene YanEugene Yan

eugeneyan.com — pragmatic ML patterns from Amazon; great on RAG, evaluation, and product-ML.eugeneyan.com——Amazon 视角的务实 ML 模式；RAG、评估、产品 ML 写得极好。

Hamel HusainHamel Husain

hamel.dev — hands-on LLM evals, fine-tuning field notes, and practical agent debugging.hamel.dev——动手的 LLM 评估、微调笔记、Agent 调试实战。

Anthropic EngineeringAnthropic 工程博客

anthropic.com/news — Constitutional AI, RSP, interpretability, tool use. Required reading before any Anthropic round.anthropic.com/news——CAI、RSP、可解释性、工具调用。Anthropic 面试前必读。

OpenAI EngineeringOpenAI 工程博客

openai.com/blog — model launches, system cards, Preparedness framework, API best practices.openai.com/blog——模型发布、system card、Preparedness 框架、API 最佳实践。

High ScalabilityHigh Scalability

highscalability.com — decade of "how X scaled" case studies across Netflix, Discord, Reddit, WhatsApp.highscalability.com——十年的「X 如何扩容」案例库，涵盖 Netflix、Discord、Reddit、WhatsApp。

AWS / GCP ArchitectureAWS / GCP 架构博客

aws.amazon.com/blogs/architecture & cloud.google.com/blog — reference architectures used in production at Fortune 500 customers.AWS 与 Google Cloud 架构博客——500 强客户生产级的参考架构。

The BatchThe Batch

deeplearning.ai/the-batch — Andrew Ng's weekly summary of ML / LLM news, with editorial commentary.deeplearning.ai/the-batch——吴恩达团队每周 ML/LLM 速报，带解读。

③ Courses ③ 课程

Pick one paid course; do not buy multiple. They overlap heavily and the returns on a second course are low. 付费课程挑一个即可，别重复买。内容高度重叠，第二门的边际收益很低。

ByteByteGoByteByteGo

bytebytego.com — Alex Xu's video course; same visual style as the books. Good for passive review.bytebytego.com——Alex Xu 的视频课，延续书的视觉风格；适合被动复习。

subscription订阅制

Grokking the System Design InterviewGrokking 系统设计面试

Educative.io — the longest-standing interactive course; text-based with inline diagrams.Educative.io——口碑最老的交互式课程，文本为主，内嵌图示。

one-time一次买断

Hello InterviewHello Interview

hellointerview.com — FAANG-focused with mock-interview videos; newer and targeted at L5+/staff.hellointerview.com——FAANG 导向，含模面视频；较新，面向 L5+/Staff 级。

freemium半免费

ExponentExponent

tryexponent.com — peer mock interviews + curated video answers; strongest for actually practicing out loud.tryexponent.com——同伴互练 + 精选视频答案；练「说」这项最强。

subscription订阅制

④ Interview Practice Platforms ④ 面试练习平台

Cross-reference what interviewers at OpenAI and Anthropic are actually asking in the last 3 months. Signal-to-noise varies; use multiple. 交叉验证过去 3 个月 OpenAI 与 Anthropic 面试官实际在问什么。信噪比差异大，建议多平台对照。

LeetCode Discuss — System DesignLeetCode Discuss · 系统设计

leetcode.com/discuss — search by company tag ("OpenAI", "Anthropic") for freshest leaks.leetcode.com/discuss——按公司标签搜「OpenAI」「Anthropic」可看最新泄题。

TeamBlindTeamBlind

teamblind.com — search "Anthropic system design" and "OpenAI onsite". Often has loop structure, interviewer names, and comp.teamblind.com——搜「Anthropic system design」「OpenAI onsite」常能拿到面试流程、面试官风格、薪资。

JointaroJointaro

jointaro.com — senior-engineer community, staff-level interview threads, coaching sessions.jointaro.com——高级工程师社区，Staff 级面试帖，付费 coaching。

PracHubPracHub

prachub.com — crowd-sourced interview-question DB; tag-filter by company + round.prachub.com——众包面经库，可按公司 + 轮次筛选。

GlassdoorGlassdoor

glassdoor.com — search "Anthropic interview questions" / "OpenAI interview questions". Older but broad.glassdoor.com——搜 Anthropic / OpenAI 面试题。陈旧但面广。

⑤ Key Papers to Memorise ⑤ 必背论文

You do not need to have read each end-to-end — but you must know the core claim, the key numbers, and the one diagram each paper is famous for. Interviewers love "so what's the intuition behind X" questions. 不必一字不落地读完——但必须会讲每篇的核心主张、关键数字和标志性图。面试官很爱问「X 背后的直觉是什么」。

Paper论文	Year年份	Why it matters为何重要
Dynamo	2007	Eventually consistent KV store; consistent hashing + quorum + vector clocks. Alex Xu Ch.6 is a direct descendant.最终一致的 KV 存储；一致性哈希+quorum+vector clock。Alex Xu 第 6 章即其直系后裔。
Bigtable	2006	LSM-tree-based wide column store; direct ancestor of HBase, Cassandra, ScyllaDB.基于 LSM 的宽列存储；HBase、Cassandra、ScyllaDB 的直系祖先。
Kafka	2011	Durable, partitioned, replayable log. The default answer to "how do your services talk?"持久、分区、可重放日志。「你的服务之间怎么通信」的默认答案。
MapReduce	2004	Batch-processing abstraction that launched the entire big-data ecosystem.启动整个大数据生态的批处理抽象。
Raft	2014	Understandable consensus. Know leader election, log replication, and safety properties.可理解版的共识。leader 选举、日志复制、安全性属性要能讲。
vLLM / PagedAttention	2023	Kwon et al. — KV cache as paged memory; 2-4x throughput over HuggingFace. Foundational for LLM serving interviews.Kwon 等——KV cache 分页管理，相对 HF 吞吐 2-4x。LLM 推理面试必备。
FlashAttention	2022	Dao et al. — IO-aware attention kernel; linear memory instead of quadratic. Enables long-context training.Dao 等——IO 感知 attention kernel；线性内存代替二次方。使长上下文训练成为可能。
Speculative Decoding	2023	Leviathan et al. — draft-verify loop gives 2-3x decode speedup without quality loss.Leviathan 等——draft-verify 循环，质量无损下解码 2-3x 提速。
Constitutional AI	2022	Bai et al. — self-critique by written principles; foundation of Anthropic's alignment stack.Bai 等——按书面原则自我批判；Anthropic 对齐栈的地基。
GPT-3	2020	Brown et al. — in-context / few-shot learning at 175B scale. The paper that changed the field.Brown 等——175B 规模的 in-context/few-shot 学习。改写行业的论文。
GPT-4 Technical Report	2023	OpenAI — predictable scaling, system card structure, and the "model spec" approach.OpenAI——可预测的扩展、system card 结构、model spec 方法论。
Megatron-LM	2019	Shoeybi et al. — tensor parallelism recipe still used everywhere in 2026.Shoeybi 等——2026 年仍随处可见的 tensor 并行方案。
ZeRO / FSDP	2019	Rajbhandari et al. — optimiser / grad / param sharding; lets you fit 100B+ on commodity clusters.Rajbhandari 等——优化器/梯度/参数分片；百亿+ 也能在普通集群训练。
Chinchilla	2022	Hoffmann et al. — compute-optimal scaling: 20 tokens per parameter. Overturned GPT-3 intuitions.Hoffmann 等——算力最优扩展：每参数 20 token。推翻 GPT-3 直觉。
Scaling Laws	2020	Kaplan et al. — predictable loss vs compute/params/data; the planning tool for all foundation-model teams.Kaplan 等——loss 对算力/参数/数据的可预测关系；所有基础模型团队的规划工具。
LoRA	2021	Hu et al. — low-rank adapters; the default fine-tuning method for 2024-2026.Hu 等——低秩 adapter；2024-2026 默认微调方法。

⑥ GitHub Repos to Know ⑥ 必知 GitHub 仓库

Skim the READMEs and scan one core file each. Being able to say "I have actually looked at vLLM's scheduler" separates serious candidates from the rest. 至少扫一遍每个仓库的 README 并读一个核心文件。能说出「我真的看过 vLLM 的调度器」的候选人明显更有竞争力。

Resources 资源汇总