真题 Arena 真题竞技场

100 verified System Design questions from OpenAI, Anthropic, Google and xAI interviews, collected from LeetCode, Blind, Exponent, PracHub, Glassdoor, Jointaro, GitHub, 小红书 and company engineering blogs. Every question links to its primary source. Click any card to open the deep solution page with architecture diagrams, API, data model, trade-offs, and expected follow-ups. 100 道经核实的 OpenAI / Anthropic / Google / xAI 系统设计真题,来源包括 LeetCode、Blind、Exponent、PracHub、Glassdoor、Jointaro、GitHub、小红书 以及各公司工程博客。每题都附有出处链接。点击任意卡片进入详细解题页——包含架构图、API、数据模型、权衡分析、面试官追问清单。

Category:类别:
Showing 100 questions
O1 Design a Webhook Delivery Platform 设计 Webhook 投递平台
OpenAI ★★★ Hard

Billions of events/day, 24h retry window, per-endpoint ordering, idempotency, DLQ, multi-tenant isolation, cost controls. 日均十亿级事件、24h 重试窗口、每 endpoint 顺序、幂等、DLQ、多租户隔离、成本控制。

O2 Design a Webhook Service (REST API) 设计 Webhook 服务(REST API)
OpenAI ★★★ Hard

REST resource semantics, cache invalidation, DB schema, queue retry semantics. Cache-aside vs write-through trade-offs explicit. REST 资源语义、缓存失效、DB schema、队列重试语义。显式讨论 cache-aside 与 write-through 的权衡。

O3 Webhook Platform with External URL Lookup (24h) 依赖外部服务的 Webhook 平台(24 小时重试)
OpenAI ★★ Hard

External dependency (ServiceB) for URL lookup + 24h retry forces you to design a robust state machine, TTL config cache, circuit breaker. URL 需从 ServiceB 查询 + 24 小时重试,把系统推向「状态机 + 外部一致性 + 可靠重试」。

O4 Design Slack 设计 Slack
OpenAI ★★★ Hard

Real-time chat, channels, presence, offline delivery, fan-out. The "2-week MVP" framing is a scope-discipline trap. 实时消息、频道、在线状态、离线投递、fanout 策略。「2 周 MVP」是范围控制的陷阱。

O5 Design a CI/CD System (GitHub Actions) 设计 CI/CD 系统(类 GitHub Actions)
OpenAI ★★★ Hard

Multi-tenant workflow engine, runner pool, log/artifact storage, lease-based scheduling, "reliability first" per interviewer. 多租户工作流引擎、Runner 池、日志/制品存储、基于 lease 的调度,「可靠性优先」。

O6 Design GitHub Actions from Scratch 从零设计 GitHub Actions
OpenAI ★★ Hard

Builds on O5 with productization: YAML config parsing/versioning, event integration, secret management, permission tokens, audit. 在 O5 基础上加产品化能力:YAML 解析/版本化、事件集成、Secret 管理、权限 Token、审计。

O7 CI/CD with Linear Multi-Step Pipeline 线性多步 CI/CD 流水线
OpenAI ★★ Medium-Hard

Minimum distributed-workflow subset: state-driven scheduling via CDC, step table, lease-based serialization. 分布式工作流引擎最小子集:CDC 驱动调度、step 表、lease 化串行。

O8 Search/Recommendation with LLMs 融合 LLM 的搜索/推荐系统
OpenAI ★★ Hard

Hybrid retrieval (BM25 + vector), reranker, where to insert the LLM (query rewrite vs summarize), offline/online eval. 混合检索(BM25 + 向量)、重排序、LLM 插入位置(Query rewrite vs Summarize)、离线/在线评估。

O9 Design an In-Memory Database 设计内存数据库
OpenAI ★★★ Hard

SET/GET/DEL with TTL + range scans; evolve to WAL, snapshots, sharding, replication. Follow-ups add GROUP BY, ORDER BY. SET/GET/DEL + TTL + 范围扫描;演进到 WAL、快照、分片、复制。追问会加 GROUP BY、ORDER BY。

O10 Fault-Tolerant Polite Web Crawler @10M RPS 容错礼貌型网页爬虫(10M RPS)
OpenAI ★★ Hard

URL frontier, politeness scheduler, robots.txt cache, Bloom-filter dedup, canonicalization, content-hash for idempotent writes. URL frontier、礼貌调度器、robots.txt 缓存、Bloom 去重、URL 规范化、基于内容 hash 的幂等写入。

O11 Design the OpenAI Playground 设计 OpenAI Playground
OpenAI ★★★ Hard

Frontend wireframe, API layer, thread/message history schema, prompt versioning, streaming output, multi-model selector. 前端线框图、API 层、线程/消息历史 schema、prompt 版本化、流式输出、多模型选择器。

O12 Design ChatGPT for 100M Users 设计承载 1 亿用户的 ChatGPT
OpenAI ★★ Hard

End-to-end scaling: session persistence, GPU fleet management, regional routing, conversation storage, usage metering. 端到端扩展:会话持久化、GPU 集群管理、区域路由、对话存储、用量计费。

O13 NSFW / Safety Detection for ChatGPT Outputs ChatGPT 输出的 NSFW/安全检测
OpenAI ★★ Hard

Data collection pipeline, model choice (rule vs classifier vs LLM-judge), latency budget, feedback loop, red-team flywheel. 数据收集管道、模型选择(规则 vs 分类器 vs LLM 评判)、延迟预算、反馈循环、red-team 飞轮。

O14 GPU Credit / Quota Scheduling System GPU 信用/配额调度系统
OpenAI ★★ Hard

Credit calculator with expiration, FIFO-consume-oldest semantics, prevent double-spend, fair-queue per tenant. 信用计算器带过期、先消费最旧额度、防止双重扣减、按租户公平排队。

O15 Streaming Token Response System Token 流式响应系统
OpenAI ★★ Hard

SSE/WebSocket, low TTFT, in-stream moderation pipeline, backpressure, reconnect with cursor, server-to-client event model. SSE/WebSocket、低 TTFT、流式 moderation、背压、带游标重连、服务端到客户端事件模型。

O16 LLM-powered Enterprise Search (RAG) 面向企业的 LLM 语义搜索(RAG)
OpenAI ★★ Hard

RAG pipeline: ingestion → chunking → embedding → vector DB → hybrid retrieval → rerank → LLM w/ citation. Hallucination mitigation. RAG 管道:采集 → 分块 → embedding → 向量库 → 混合检索 → 重排 → 带引用的 LLM 生成。幻觉缓解。

O17 Design a Rate Limiter 设计限流器
OpenAI ★★ Medium

Token bucket vs leaky bucket vs sliding window. Token-based billing for LLM APIs. Distributed coordination via Redis. 令牌桶 vs 漏桶 vs 滑动窗口。LLM API 的 token 级计费限流。基于 Redis 的分布式协调。

O18 Design a Vector Database 设计向量数据库
OpenAI ★★ Hard

Store & search billions of embeddings. ANN algorithms (HNSW, IVF-PQ), sharding, hybrid filter queries, ingestion pipeline. 数十亿 embedding 的存储与搜索。ANN 算法(HNSW、IVF-PQ)、分片、混合过滤查询、采集管道。

O19 Distributed ML Training Platform 分布式 ML 训练平台
OpenAI ★★ Staff-level

Orchestrate training across thousands of GPUs: DP/TP/PP, ZeRO, checkpointing, fault-tolerance, job scheduler, bandwidth-aware routing. 数千 GPU 的训练编排:DP/TP/PP、ZeRO、Checkpoint、容错、作业调度、带宽感知路由。

O20 Design a URL Shortener (Shorten URL) 设计短链接系统
OpenAI ★★★ Medium

One of OpenAI's five SD-pool classics. Base62 vs hash-of-URL, redirect path cache, write/read ratio, analytics, custom aliases, expiration. OpenAI 五大 SD 题池之一。Base62 vs URL 哈希、重定向路径缓存、读写比、分析、自定义别名、过期。

O21 Design a Chat Room 设计聊天室
OpenAI ★★ Medium-Hard

Narrower than Slack (O4): single/multi chat rooms, WebSocket + pub/sub, presence, message ordering, history pagination. No DMs, workspaces, or threads. 比 Slack(O4)窄:单/多聊天室、WebSocket + pub/sub、在线状态、消息顺序、历史翻页。不含 DM、工作区、Thread。

O22 Design a Toy Language Interpreter 设计自定义语言解释器
OpenAI ★★ Hard

Language-runtime flavor: lexer → parser → AST → tree-walking evaluator. Sandbox, memory/time limits, stdlib surface area, REPL vs script mode, error UX. 语言运行时:词法 → 语法 → AST → 树遍历求值。沙箱、内存/时间上限、标准库边界、REPL vs 脚本模式、错误提示。

O23 Enterprise GPT for a 20K-Employee Company (RAG + ACL) 2 万人企业的 Enterprise GPT(RAG + ACL)
OpenAI ★★★ Hard

"Strong Hire" real question. Four-layer: ingestion/chunking → retriever → evaluator → generator. Traceable citations, per-doc ACLs, P95 < 2s, EN/中 multilingual, online learning from feedback. 「Strong Hire」真题。四层架构:采集/分块 → 检索 → 评估 → 生成。可追溯引用、按文档 ACL、P95 < 2s、中英双语、基于反馈的在线学习。

A11 High-Concurrency LLM Inference Service 高并发 LLM 推理服务
Anthropic ★★★★ Hard

Streaming tokens, prefill vs decode split, KV cache, continuous batching, tail latency, GPU memory management. The canonical Anthropic Q. 流式 token、prefill 与 decode 分相、KV cache、连续 batching、尾延迟控制、GPU 显存管理。Anthropic 经典题。

A12 GPU Inference Request Batching GPU 推理请求动态 batching
Anthropic ★★★ Hard

Flush policies (size/age/length-spread), head-of-line blocking mitigation, admission control, observability, overload handling. Flush 策略(大小/时间/长度方差)、队头阻塞缓解、准入控制、可观测性、过载处理。

A13 Inference Routing & Scheduling Layer 推理路由与调度层
Anthropic ★★★ Hard

Priority queues, credit-based fairness (WFQ/DRR), result cache for determinism (temp=0), heterogeneous hardware pools. 优先级队列、基于 credit 的公平调度(WFQ/DRR)、temp=0 确定性结果缓存、异构硬件池。

A14 Batch Inference API 批量推理 API
Anthropic ★★★ Hard

POST job + poll for results, idempotency, partial batch failures, result pagination, cost-optimized off-peak scheduling. 提交作业 + 轮询结果、幂等、批内部分失败、结果分页、低峰期成本优化调度。

A15 Multi-Model GPU Inference API 多模型 GPU 推理 API
Anthropic ★★★ Hard

Control plane + data plane split, model registry, canary + rollback, A/B routing, autoscaling, warm/cold tiers. 控制面 + 数据面分离、模型注册表、灰度 + 回滚、A/B 路由、自动扩缩容、冷/热分层。

A16 Low-Latency ML Inference API 低延迟 ML 推理 API
Anthropic ★★★ Hard

SLOs (p95, availability, QPS), online feature store, rollout/canary, drift detection, degradation strategies. SLO(p95、可用性、QPS)、在线 feature store、rollout/canary、漂移检测、降级策略。

A17 Review an Inference API Design for Scale 评审他人的推理 API 设计
Anthropic ★★ Hard

Design-review rubric: fill missing SLOs, single points of failure, improvement priority (safety → efficiency → cost). 评审清单:补全缺失的 SLO、找单点故障、改进优先级(保命 → 提效 → 降本)。

A18 Model Downloader & Artifact Distribution 模型分发器 / 制品分发系统
Anthropic ★★ Medium-Hard

Manifest-driven releases, atomic symlink switch, rollback, thundering-herd avoidance, audit trail, integrity validation. Manifest 驱动发布、原子 symlink 切换、回滚、防止惊群、审计链路、完整性校验。

A19 Prompt Playground / Experiment Platform Prompt 实验平台
Anthropic ★★ Medium

Prompt versioning, experiment mgmt, side-by-side eval, prompt caching, collaboration/ACL, huge-context strategy. Prompt 版本化、实验管理、并排对比、Prompt 缓存、协作/权限、超长上下文策略。

A20 Performance Take-Home (Optimization) 性能 Take-Home(底层优化)
Anthropic ★★ Hard

Optimize simulated machine cycles — benchmark-driven iteration. Includes explicit warning that LLMs can "cheat" by modifying tests. 优化模拟机器周期——benchmark 驱动迭代。明确警告 LLM 可能修改测试「作弊」。

A21 Design Claude Chat Service 设计 Claude Chat 服务
Anthropic ★★ Hard

Session management, streaming output, token-level billing, log aggregation, safety filter integration. 会话管理、流式输出、Token 级计费、日志聚合、Safety 过滤器集成。

A22 P2P File Distribution (BitTorrent-style) P2P 大文件分发(类 BitTorrent)
Anthropic ★★ Hard

Bandwidth-constrained distribution of large files (model weights, datasets) to thousands of machines. Peer discovery, tit-for-tat, chunk selection. 带宽受限下将大文件(模型权重、数据集)分发到数千台机器。Peer 发现、tit-for-tat、分块选择。

A23 Handle 100K RPS for LLM Token Generation 承载 100K RPS 的 LLM 吞吐
Anthropic ★★ Hard

Horizontal scaling for throughput, request routing across replicas, GPU pool sizing, load-based autoscale. 吞吐导向的水平扩展、副本间请求路由、GPU 池容量规划、基于负载的自动扩缩容。

A24 Design a Key-Value Store (Dynamo-style) 设计键值存储(Dynamo 风格)
Anthropic ★★ Hard

Consistent hashing, quorum reads/writes, vector clocks for conflict resolution, Merkle trees for anti-entropy, gossip for membership. 一致性哈希、读写 quorum、向量时钟冲突解决、Merkle 树反熵、gossip 成员协议。

A25 Design an Agentic AI System 设计自主 Agent 系统
Anthropic ★★ Hard

Agent loop (reason → plan → act), tool use via MCP, short/long-term memory, multi-agent coordination, sandbox & guardrails, infinite-loop prevention. Agent 主循环(推理 → 规划 → 执行)、基于 MCP 的工具使用、短/长期记忆、多 Agent 协作、沙箱与护栏、无限循环防护。

A26 Design a Web Crawler 设计网页爬虫
Anthropic ★★ Hard

Distributed fetching, deduplication, multi-threaded / async, rate control. Interviewers will follow up on scaling and robots.txt handling. 分布式抓取、去重、多线程/异步、速率控制。面试官会追问扩展性与 robots.txt 处理。

A27 Design a Banking App 设计银行应用
Anthropic ★★ Medium-Hard

Traditional: transactional consistency, double-entry ledger, idempotent transfers, fraud detection, audit log, regulatory compliance. 传统考察:事务一致性、双重记账、幂等转账、反欺诈、审计日志、合规要求。

A28 Distributed Search for 1B Documents @1M QPS 1B 文档 @1M QPS 的分布式搜索
Anthropic ★★★ Hard

1B documents, 1M QPS. Sharding strategies, hot-spot avoidance, multi-level cache, LLM inference scaling, GPU memory optimization. 1B 文档、1M QPS。分片策略、热点规避、多级缓存、LLM 推理扩展、GPU 内存优化。

A29 Model Serving Platform for LLMs LLM 模型服务平台
Anthropic ★★★ Hard

Open-ended: clarify requirements → high-level architecture → safety/latency/reliability trade-offs. You must drive the conversation. 开放式:澄清需求 → 高层架构 → 安全/延迟/可靠性权衡。你必须主导对话。

A30 Design Instagram (Feed Generation) 设计 Instagram(信息流生成)
Anthropic ★★★ Hard

Feed generation is the crux: push vs pull vs hybrid. Interviewer will push on celebrities with millions of followers → answer is hybrid. Follow-up probes DB read scaling. Feed 生成是核心:push vs pull vs hybrid。面试官一定会追问"百万粉明星"→ 正解是 hybrid。后续追问数据库读扩展。

O24 Design a Distributed Job Queue for ML Workloads 设计面向 ML 工作负载的分布式任务队列
OpenAI ★★ Hard

Heterogeneous GPU jobs scheduled across a pool with priority, preemption, at-least-once + idempotent handlers. GPU 池上调度异构任务;含优先级、抢占、at-least-once 与幂等 handler。

O25 Design OpenAI Batch Inference API 设计 OpenAI 批量推理 API
OpenAI ★★★ Hard

Async endpoint: upload JSONL, poll for completion. 24h SLA at 50% discount. Scheduler fills idle GPU capacity. 异步端点:上传 JSONL、轮询完成;24h SLA、5 折价;调度器填充 GPU 空闲。

O26 Design OpenAI Assistants / Threads API 设计 OpenAI Assistants / Threads API
OpenAI ★★ Hard

Server-side conversational state: persistent threads, tool calls, file search, SSE streaming, run lifecycle. 服务端维持会话:持久 threads、工具调用、文件检索、SSE 流式、run 生命周期。

O27 Design a Fine-Tuning Platform 设计微调平台
OpenAI ★★ Hard

Dataset upload → validation → scheduled training → model artefact → private-tag inference. 数据集上传 → 校验 → 排程训练 → 模型制品 → 私有 tag 投入推理。

O28 Design a Feature-Flag Platform 设计特性开关平台
OpenAI ★★ Medium

Targeting rules, percentage rollout, SDK, low-latency evaluation, audit trail, A/B analysis. 定向规则、百分比灰度、SDK、低延迟判定、审计日志、A/B 分析。

O29 Design a Realtime Voice Backend (OpenAI Realtime API) 设计实时语音后端(OpenAI Realtime API)
OpenAI ★★ Hard

Bidirectional audio with interruption, ASR+LLM+TTS streamed, sub-300 ms first audio. 双向音频 + 打断;ASR+LLM+TTS 全流式;首音 < 300 ms。

O30 Design an Evals Platform 设计评估平台
OpenAI ★★ Medium

Author graders, run suites, compare across runs, catch regressions pre-rollout. 编写打分器、跑 suite、跨次对比、上线前捕获回归。

O31 Design Prompt / Model Cache 设计 Prompt / 模型缓存
OpenAI ★★★ Medium

Detect shared prefixes, cache KV states, serve subsequent tokens cheaper. 识别共享前缀,缓存 KV,后续 token 低成本生成。

O32 Design a Content Moderation Pipeline 设计内容审核流水线
OpenAI ★★ Hard

Input + output moderation at API boundary; classifier ensemble; block/flag/review tiers; appeals. 入站与出站均审核;分类器集成;阻断/打标/人工复核;申诉流程。

O33 Design an Autocomplete Service (Codex/Copilot-like) 设计自动补全服务(类 Codex/Copilot)
OpenAI ★★ Hard

IDE-triggered completions: p95 < 200 ms, huge cancellation rate, context with repo retrieval, privacy tiers. IDE 触发补全:p95 < 200 ms,取消率极高,上下文含仓库检索,隐私分级。

O34 Design Image Generation Serving (DALL-E) 设计图像生成服务(DALL-E)
OpenAI ★★ Hard

Text-to-image path, variable step counts, upscaling, NSFW filters, CDN for outputs. 文生图路径、可变步数、上采样、NSFW 过滤、输出走 CDN。

O35 Design Usage Metering & Billing for LLM API 设计 LLM API 的用量计量与计费
OpenAI ★★ Medium

Per-request token counts, aggregation by user/org, daily billing, rate-limit enforcement, audit-grade records. 按请求 token 计数、按用户/组织聚合、日结计费、限流执行、审计级记录。

O36 Design a Distributed Log / Trace Pipeline 设计分布式日志 / 链路追踪流水线
OpenAI ★★ Medium

High-cardinality logs/traces from GPU fleet, cost-controlled sampling, fast search, retention tiers. GPU fleet 高基数日志/trace、成本可控采样、快速检索、分级保留。

O37 Design a Tool-Use Sandbox for Agents 设计智能体工具沙箱
OpenAI ★★ Hard

Execute untrusted code/tool calls produced by LLMs in isolation: filesystem, network, time limits. 在隔离环境执行 LLM 产出的不可信代码/工具调用:文件系统、网络、时间限制。

O38 Design a Multi-Region API Gateway 设计多区域 API 网关
OpenAI ★★ Medium

Global front door for API traffic, auth, rate limit, routing to regional fleet, automatic failover. 全球前门:认证、限流、路由到区域 fleet、自动故障切换。

A31 Design a Rate Limiter for the Claude API 为 Claude API 设计限流器
Anthropic ★★★ Medium

Per-org RPM/TPM/TPD buckets, tier-based, token-level not just request-level, priority passes for enterprise. 按组织 RPM/TPM/TPD 多维桶,分层限额,按 token 而非请求计,企业优先。

A32 Design Anthropic's Safety Pipeline 设计 Anthropic 的安全流水线
Anthropic ★★★ Hard

Input + output + tool-use moderation, policy taxonomy, ASL gating, red-team loop, eval feedback. 入站、出站、工具调用均审核;策略分类;ASL 等级门控;红队闭环;评估反馈。

A33 Design Claude-for-Work RAG (Enterprise) 设计 Claude-for-Work 企业级 RAG
Anthropic ★★ Hard

Corporate data (Slack, GDrive, Notion) ingested with ACLs, retrieval respects per-user permissions, zero cross-tenant leakage. 企业数据(Slack/GDrive/Notion)连同 ACL 一起 ingest;检索尊重用户权限;零跨租户泄漏。

A34 Design Conversation Memory for Claude 为 Claude 设计会话记忆
Anthropic ★★ Hard

Long-running conversations beyond context window, user-specific memory, opt-in/out, privacy controls. 超长对话超出上下文;用户级记忆;opt-in/out;隐私控制。

A35 Design the MCP Server Registry 设计 MCP 服务注册中心
Anthropic ★★ Medium

Directory of Model Context Protocol servers with discovery, versioning, signing, security review, per-user install scope. MCP 服务目录:发现、版本、签名、安全审核、按用户安装范围。

A36 Design a Code Execution Sandbox for Claude 为 Claude 设计代码执行沙箱
Anthropic ★★ Hard

Run agent-generated Python/Bash safely: network policy, filesystem policy, resource caps, artefact return. 安全执行代理生成的 Python/Bash:网络策略、文件系统策略、资源上限、产物回传。

A37 Design a Training Checkpoint Service 设计训练 Checkpoint 服务
Anthropic ★★ Hard

At 10k+ GPUs, checkpoint every N steps without stalling training. Async sharded writes, resumable. 万卡训练每 N 步 checkpoint 而不阻塞。异步分片写,故障可恢复。

A38 Design Claude's Prompt Caching Service 设计 Claude 的 Prompt 缓存服务
Anthropic ★★★ Medium

Ephemeral KV cache keyed on prompt prefix; 5-minute TTL; explicit cache_control markers; 90% discount on hits. 按 prefix 的临时 KV 缓存;5 分钟 TTL;显式 cache_control;命中 9 折优惠。

A39 Design an Evals Platform for Alignment Research 为对齐研究设计评估平台
Anthropic ★★ Medium

Runs constitutional AI + safety evals, red-team results, compares across checkpoints, blocks regressions. 跑 Constitutional AI + 安全评估、红队结果;跨 checkpoint 对比;阻断能力回归。

A40 Design Anthropic's Billing Pipeline 设计 Anthropic 计费流水线
Anthropic ★★ Medium

Per-second token metering, monthly invoicing, commit plans, credit card + invoice billing, dispute flow. 秒级 token 计量、月度开票、承诺计划、信用卡 + 发票、纠纷流程。

A41 Design a Red-Team Detection System 设计红队攻击检测系统
Anthropic ★★ Hard

Detect probing / jailbreak campaigns across users, cluster attack patterns, feedback to safety team. 跨用户检测探测/越狱攻势、聚类攻击模式、反馈安全团队。

G1 Design Gemini API Serving 设计 Gemini API 推理服务
Google ★★ Hard

Google's Gemini API: multimodal (text+image+video+audio), 2M-token context, TPU fleet, global endpoints. Google Gemini API:多模态、2M token 上下文、TPU fleet、全球端点。

G2 Design Vertex AI Training Pipelines 设计 Vertex AI 训练流水线
Google ★★ Hard

Managed pipelines: data → feature → train → validate → deploy; DAG, artefacts, retries, multi-tenant. 托管流水线:数据 → 特征 → 训练 → 验证 → 部署;DAG、制品、重试、多租户。

G3 Design NotebookLM 设计 NotebookLM
Google ★★ Hard

Upload sources → chat + podcast generation grounded in sources → high citation rate. 上传资料 → 基于资料聊天 + 播客生成 → 高引用率。

G4 Design Google Search AI Overviews 设计 Google Search AI 概览
Google ★★ Hard

Selective LLM summary above classic results: decide when to trigger, retrieve diverse sources, cite links, sub-second p95. 在经典结果上方按需触发 LLM 摘要:何时触发、多源检索、引用链接、p95 秒级。

G5 Design a TPU Cluster Scheduler 设计 TPU 集群调度器
Google ★★ Hard

Borg-style scheduler for TPU pods: topology-aware, gang scheduling, interconnect locality, preemption. Borg 风格 TPU pod 调度:拓扑感知、gang scheduling、互联亲和、抢占。

G6 Design Google Web Crawler 设计 Google 网络爬虫
Google ★★★ Hard

Crawl billions of pages/day, respect robots.txt, re-crawl freshness policy, politeness per host, dedup. 日爬数十亿页,遵守 robots.txt,按新鲜度重爬,按域礼貌延迟,去重。

G7 Design Google Search Index (Inverted Index) 设计 Google 搜索索引(倒排索引)
Google ★★★ Hard

Build and serve an inverted index over billions of docs: term posting lists, sharding, tiered storage. 千亿级文档的倒排索引构建与检索:term posting、分片、分层存储。

G8 Design Google Search Suggestions (Typeahead) 设计 Google 搜索建议(Typeahead)
Google ★★★ Medium

Top-K popular prefixes, personalised, real-time trend surfacing, safety filter. 前缀 top-K、个性化、实时热点、安全过滤。

G9 Design Google Spell-Check 设计 Google 拼写纠错
Google ★★ Medium

Detect misspellings, propose corrections: edit distance + language model + context-aware rerank. 拼写错误检测与纠正:编辑距离 + 语言模型 + 上下文 rerank。

G10 Design YouTube View Count 设计 YouTube 观看数
Google ★★★ Medium

Count views accurately but cheaply at billions/day; anti-inflation; approximate hot-video counts. 日十亿级计数,抗刷量,热门视频近似。

G11 Design YouTube Video Upload & Transcode 设计 YouTube 视频上传与转码
Google ★★★ Hard

Resumable upload of GB-scale video; parallel transcode to resolution ladder; CDN distribution; thumbnails. GB 级可断点续传;并行转码多码率;CDN 分发;缩略图。

G12 Design YouTube Recommendations 设计 YouTube 推荐
Google ★★ Hard

Two-stage recsys: candidate gen (two-tower/ANN) + ranking (deep); cold-start, freshness, diversity. 两阶段推荐:召回(two-tower/ANN) + 排序(深度模型);冷启、新鲜度、多样性。

G13 Design a Video CDN / Live Streaming 设计视频 CDN / 直播
Google ★★ Hard

Origin → edge pull / push; HLS/DASH; live ingest with low-latency HLS; failover between POPs. 源站 → 边缘拉/推;HLS/DASH;低延迟 HLS 直播;PoP 间故障切换。

G14 Design Google Maps Tile Service 设计 Google 地图瓦片服务
Google ★★ Hard

Serve pre-rendered raster/vector tiles at zoom levels 0-22; edge caching; incremental updates; offline. 0-22 级缩放瓦片(栅格/矢量);边缘缓存;增量更新;离线。

G15 Design Google Maps Routing & ETA 设计 Google Maps 路线与 ETA
Google ★★ Hard

Shortest-path on road graph + real-time traffic; learned ETA; billions queries/day. 道路图上最短路径 + 实时路况;学习式 ETA;日十亿级查询。

G16 Design a Globally Consistent DB (Spanner-like) 设计全球一致 DB(Spanner 风格)
Google ★★ Hard

Globally distributed SQL with external consistency via TrueTime and 2PC over Paxos groups. 全球分布式 SQL,通过 TrueTime 与 Paxos 组上的 2PC 保证外部一致。

G17 Design Bigtable 设计 Bigtable
Google ★★ Hard

Petabyte-scale wide-column store: tablet-based sharding, GFS for persistence, Chubby for coordination. PB 级宽列存储:tablet 分片、GFS 持久化、Chubby 协调。

G18 Design Gmail Backend 设计 Gmail 后端
Google ★★ Hard

Billion users × 10k emails; labels, search, spam filter, attachments, IMAP/SMTP compat. 十亿用户 x 万封邮件;label、搜索、反垃圾、附件、IMAP/SMTP。

G19 Design Google Docs Realtime Collaboration 设计 Google Docs 实时协作
Google ★★★ Hard

Conflict-free multi-user editing, presence, offline, history timeline. 多人无冲突编辑、在线状态、离线、历史时间线。

G20 Design AdWords Bidding & Serving 设计 AdWords 竞价与投放
Google ★★ Hard

Second-price auction per query, ad ranking with CTR prediction, budget pacing, advertiser reporting. 每 query 第二价拍卖、基于 CTR 的广告排序、预算节流、广告主报表。

G21 Design Google Pay Transactions 设计 Google Pay 交易
Google ★★ Hard

NFC/tap-to-pay + online checkout; tokenised PAN; fraud detection; reconciliation with issuers. NFC 刷卡 + 在线结账;PAN token 化;风控;与发卡行对账。

G22 Design Firebase Realtime Database / Pub-Sub 设计 Firebase 实时数据库 / Pub-Sub
Google ★★ Medium

Hierarchical JSON tree with realtime listeners to mobile clients at scale; ACL rules; offline. 层次化 JSON 树,移动端实时监听;规则引擎;离线。

G23 Design Google Photos 设计 Google Photos
Google ★★ Medium

Upload, dedup, face/thing classification, search, albums, sharing. 上传、去重、人脸/物体识别、搜索、相册、分享。

G24 Design Android Push Notifications (FCM) 设计 Android 推送(FCM)
Google ★★ Medium

FCM: developers push to billions of devices; priority classes; battery-efficient delivery. FCM:开发者向十亿设备推送;优先级分层;省电投递。

G25 Design Chrome Sync 设计 Chrome Sync
Google ★★ Medium

Sync bookmarks/passwords/tabs across devices; e2e encryption; conflict resolution. 跨设备同步书签/密码/标签;端到端加密;冲突解决。

X1 Design Grok's Inference Serving Stack 设计 Grok 的推理服务栈
xAI ★★ Hard

Serve Grok to 200M+ X users with sub-second first-token latency, while competing on cost. 为 2 亿 X 用户提供亚秒级首字延迟的 Grok 推理,同时控制成本。

X2 Design DeepSearch (Agentic Web Research) 设计 DeepSearch(Agent 式网页深度研究)
xAI ★★ Hard

Grok's DeepSearch plans multi-step web queries, fetches and summarizes pages, and returns cited answers. Grok 的 DeepSearch 规划多步网页查询、抓取并总结页面、返回带引用的答案。

X3 Design the X Firehose Ingestion for Grok Training 设计供 Grok 训练使用的 X 全量 Firehose 摄入
xAI ★★ Hard

Ingest X's real-time post firehose for Grok training and real-time features. 摄入 X 的实时帖子 firehose,用于 Grok 训练和实时特征。

X4 Design X For-You Re-ranking with Grok 设计用 Grok 重排序 X 的 For-You 信息流
xAI Medium

Use Grok to re-rank candidate posts for X's For-You timeline based on user intent signals. 用 Grok 基于用户意图信号对 X For-You 候选帖子进行重排序。

X5 Design Training Orchestration for 100k+ GPU Colossus 设计 10 万卡 Colossus 的训练编排
xAI ★★ Hard

Orchestrate a training run across 100,000+ H100 GPUs in Memphis with fault tolerance and checkpointing. 在孟菲斯的 10 万+ H100 GPU 上编排训练,具备容错和检查点能力。

X6 Design Grok Voice Mode 设计 Grok 语音模式
xAI Medium

Real-time voice conversations with Grok in the X mobile app. 在 X 移动端中与 Grok 进行实时语音对话。