Source cross-reference

This page synthesises Chip Huyen Ch.7 (Deployment), Ch.10 (Infra & MLOps) and Khang Pham Ch.2 (core ML primer). Read Chip Huyen Ch.7 on "Batch prediction vs online prediction" and Ch.10 on "ML platforms" as the companions to this note.

Why an ML platform exists

A team with one model and one engineer needs no platform. A team with 20 models, 40 features per model, 5 training frameworks and weekly deploys cannot survive without one. Chip Huyen Ch.10 frames the problem as removing the N×M explosion: N modelling teams × M plumbing concerns (feature freshness, online lookup, rollback, monitoring). A platform collapses the M concerns into one paved road.

In interviews, when the question says "design a recommender for 100M DAU", do not jump to models. First sketch the lifecycle: data → features → training → registry → serving → monitoring → retraining, and only then drill into the piece the interviewer cares about. Scoring rubrics at both OpenAI and Anthropic reward candidates who explicitly trace a feature from source log to a served prediction.

Feature store: offline vs online split

A feature store has two physical stores with one logical API: an offline store (Hive, S3+Parquet, BigQuery) sized in terabytes for training, and an online store (Redis, DynamoDB, RocksDB, ScyllaDB) sized in gigabytes for serving with p99 < 5 ms lookups. The two must stay consistent; any skew becomes an invisible prediction bug. Feast, Tecton and Databricks Feature Store are the common concretisations; roll-your-own is a valid answer if you justify it.

flowchart LR
  logs[Event logs / CDC] --> stream[Stream job: Flink/Spark Streaming]
  logs --> batch[Batch job: Spark/dbt]
  batch --> off[(Offline store
Parquet/Iceberg)] stream --> on[(Online store
Redis/Scylla)] stream -.replicate.-> off off --> train[Training job] on --> serve[Inference service] train --> reg[Model registry] reg --> serve

Key design rules:

  • Point-in-time joins on the offline side to prevent label leakage; you must reconstruct what the online store would have returned at event time.
  • Feature versioning: adding a feature never mutates an existing ID; it creates feature_v2 to preserve reproducibility.
  • TTL and freshness budget per feature: a user-activity counter might have a 10 s freshness SLO, country-of-residence 24 h.
  • Typical scale: Uber's Michelangelo reports ~10,000 features, ~2,000 online features touched per request, p99 ~2 ms from a sharded Cassandra.

Training pipeline & model registry

Training is not a notebook; it is a DAG executed by Airflow, Kubeflow, Metaflow, or Flyte. The DAG must be idempotent, parameterised, and reproducible: same inputs + same code + same seeds ⇒ bit-equivalent model, within fp tolerance. Outputs go to a registry (MLflow, Weights & Biases, SageMaker Model Registry, Vertex AI Model Registry) keyed by (model_name, version) and carrying lineage metadata: training dataset hash, feature schema version, code commit SHA, hyperparameters, offline metrics, and human sign-off.

Registry state machine: trained → staging → production → archived. Promotion from staging to production is gated by the CI/CD tests below, never by a human pressing a green button alone.

CI/CD for ML (tests that matter)

Traditional CI tests code. ML CI tests code + data + model. The five tests that catch 90% of regressions:

  1. Schema test: new feature batch matches the declared schema (types, ranges, nullability).
  2. Data drift test: PSI < 0.2 vs training distribution on every top-20 feature.
  3. Offline metric test: new model's AUC / NDCG / calibration is no worse than previous by more than a configured epsilon (e.g. 0.3%).
  4. Slice test: no protected subgroup (geo, language, new-user cohort) regresses by >1%.
  5. Behavioural test: a curated set of ~200 "golden" inputs produce expected outputs; equivalent to unit tests but on model outputs.

Concrete numbers

A healthy ML platform at a mid-sized company ships 5–20 model versions per week to staging, 1–3 to production. Median time from commit to canary = 2 hours; full rollout with 48 h observation window = 2–3 days.

Serving: batch vs online vs streaming

Chip Huyen Ch.7 makes three shapes explicit; know when to pick each:

  • Batch prediction: compute daily/hourly, store into a KV; serve by lookup. Used when inputs change slowly (user LTV, lead score). p99 read = <10 ms. Drawback: stale predictions for new users.
  • Online (request-time): compute on the hot path. Used when inputs change fast (search ranking, ad CTR, LLM routing). p99 = 20–200 ms. Drawback: tail latency and cost risk.
  • Streaming: pre-compute on events as they arrive (Flink, Spark Streaming); store into online store. Used when you need "online-like" freshness but the model itself is too heavy for request-time (e.g. user embedding update every session).

Hybrid is the norm: candidate generation batch-precomputed, ranking online, feature updates streaming.

Shadow, canary, A/B and interleaving

Chip Huyen Ch.9 catalogues the four online evaluation strategies; they are a hierarchy of risk/information trade-offs:

flowchart LR
  new[New model v2] --> shadow[1. Shadow
0% user impact
log only] shadow --> canary[2. Canary
1-5% traffic
abort on KPI regression] canary --> ab[3. A/B test
50/50 · 1-2 weeks
powered for MDE] ab --> rollout[4. Full rollout
100%] ab -.-> interleave[Interleaving
same-user pairs
10x sample efficiency]
  1. Shadow: send the same request to v1 and v2, serve v1, log both. Catches crashes, latency regressions and gross output drift before any user sees v2.
  2. Canary: route 1–5% of traffic to v2 with automatic abort on pre-declared KPI guardrails (error rate, latency p99, business metric).
  3. A/B: 50/50 (or power-calculated) split for 1–2 weeks. Requires an MDE (minimum detectable effect) calculation; a common anti-pattern is reading the test early and declaring victory on noise.
  4. Interleaving (for ranked lists): serve a list that interleaves v1 and v2 results to the same user; about 10× more sample-efficient than user-split A/B for ranking problems (Chapelle & Li).

Anti-pattern: "we deployed straight to 100%"

Skipping shadow + canary and going directly to A/B is the most common outage generator. A new embedding version with a silent OOV bug can null-out 0.5% of recommendations and cost days of debugging. Always spend the 48 h in shadow, regardless of urgency.

OpenAI vs Anthropic framing

OpenAI emphasis

OpenAI interviews treat the ML platform as the foundation for fast iteration: expect probing on feature-store freshness SLOs, online feature serving latency, and the CI/CD tests that keep a model-versions-per-week cadence safe. Bias toward online serving examples (search, ranking, ad).

Anthropic emphasis

Anthropic asks the same questions but from a safety-first angle: shadow mode is non-negotiable, rollback must be one click, every promotion must carry a human sign-off field in the registry, and offline eval suites must include harm-slice tests alongside accuracy. The registry is a compliance artefact, not just a version list.

Either way, the winning interview narrative is: define the lifecycle, pick the deploy shape justified by input velocity, bound the risk with shadow→canary→A/B, and wire monitoring to the registry so you can roll back in <5 min.

来源对照

本页综合 Chip Huyen 第 7 章(部署)第 10 章(基础设施与 MLOps)Khang Pham 第 2 章(ML 核心入门)。建议配合 Chip Huyen Ch.7「批预测 vs 在线预测」与 Ch.10「ML 平台」一起阅读。

为什么需要 ML 平台

一个团队、一个模型、一个工程师,完全不需要平台。但当团队有 20 个模型、每个 40 个特征、5 种训练框架、每周都要上线时,没有平台是活不下去的。Chip Huyen 第 10 章把问题抽象为消除 N×M 爆炸:N 个建模团队 × M 个基础设施问题(特征时效、在线查询、回滚、监控)。平台把这 M 件事汇聚成一条「铺好的路」。

面试里,当题目说「为 1 亿 DAU 设计推荐系统」,不要直接冲进模型。先画出生命周期:数据 → 特征 → 训练 → 注册表 → 服务 → 监控 → 再训练,再根据面试官的兴趣点深挖。OpenAI 与 Anthropic 的评分标准都奖励能把某个特征从源日志追踪到被服务端预测使用的候选人。

特征平台:离线/在线双库

特征平台有两个物理库、一个逻辑 API:离线库(Hive、S3+Parquet、BigQuery),TB 量级,供训练;在线库(Redis、DynamoDB、RocksDB、ScyllaDB),GB 量级,服务时 p99 < 5 ms。两者必须保持一致——任何偏差都会变成看不见的预测 bug。常见实现有 Feast、Tecton、Databricks Feature Store;自研也是合理答案,但要说明理由。

flowchart LR
  logs[事件日志 / CDC] --> stream[流任务 Flink/Spark]
  logs --> batch[批任务 Spark/dbt]
  batch --> off[(离线库
Parquet/Iceberg)] stream --> on[(在线库
Redis/Scylla)] stream -.回写.-> off off --> train[训练任务] on --> serve[推理服务] train --> reg[模型注册表] reg --> serve

关键设计点:

  • 时间点连接(point-in-time join)——离线必须重现事件发生那一刻在线库会返回的数据,防止 label 泄漏。
  • 特征版本化——新增特征永远新建 feature_v2,不动旧 ID,确保可复现。
  • 每个特征独立的 TTL 与时效预算:比如「用户活跃计数」10 秒,「居住国」24 小时。
  • 规模参考:Uber Michelangelo 约 1 万个特征,每次请求用到 2,000 个在线特征,分片 Cassandra 上 p99 约 2 ms。

训练流水线与模型注册表

训练不是 Notebook,而是一个由 Airflow / Kubeflow / Metaflow / Flyte 执行的 DAG。DAG 必须幂等、参数化、可复现:同输入+同代码+同种子 ⇒ 浮点误差内的比特等模型。产物写入注册表(MLflow、W&B、SageMaker、Vertex AI),以 (model_name, version) 为主键,并附带血缘元数据:训练数据哈希、特征 schema 版本、代码 commit SHA、超参数、离线指标、人工签字。

注册表状态机:trained → staging → production → archived。staging 到 production 的晋升应由 CI/CD 测试把关,而不是仅靠人点绿色按钮。

ML 的 CI/CD(哪些测试最重要)

传统 CI 测代码;ML CI 要测代码+数据+模型。能拦住 90% 回归的五类测试:

  1. Schema 测试——新一批特征的类型、范围、是否可空要与声明一致。
  2. 数据漂移测试——每个 Top-20 特征相对训练分布的 PSI < 0.2。
  3. 离线指标测试——新模型 AUC / NDCG / 校准度相对旧模型不能劣化超过 epsilon(例如 0.3%)。
  4. 分片测试——任何受保护子群(地域、语言、新用户)不得劣化 >1%。
  5. 行为测试——约 200 个「金样本」输入产出固定输出,相当于模型输出的单元测试。

具体数字

健康的 ML 平台每周推 5–20 个模型版本到 staging,1–3 个到生产。commit 到 canary 中位数 2 小时,含 48 小时观察期的全量发布 2–3 天

服务化:批 vs 在线 vs 流

Chip Huyen Ch.7 把三种形态讲得很清楚:

  • 批预测——每日/小时跑一次,结果进 KV,查即返。适用于输入变化慢的场景(用户 LTV、线索评分)。p99 读 <10 ms。缺点:新用户预测会滞后。
  • 在线(请求时)——直接在热路径上算。适用于输入变化快的场景(搜索排序、广告 CTR、LLM 路由)。p99 = 20–200 ms。缺点:长尾延迟和成本。
  • 流式——事件到达时提前算(Flink、Spark Streaming),写入在线库。用于需要「近实时」新鲜度、但模型本身太重无法请求时跑的情况(例如会话级用户 embedding 更新)。

实战里几乎都是混合:召回批预计算,排序在线算,特征流式更新。

Shadow、灰度、A/B 与 interleaving

Chip Huyen Ch.9 把四种在线评估策略按风险/信息量排成层级:

flowchart LR
  new[新模型 v2] --> shadow[1. Shadow
0% 用户影响
仅记录] shadow --> canary[2. Canary
1-5% 流量
KPI 异常自动中止] canary --> ab[3. A/B 测试
50/50 · 1-2 周
按 MDE 配置样本] ab --> rollout[4. 全量] ab -.-> interleave[Interleaving
同用户对比
样本效率 10x]
  1. Shadow:请求同时发给 v1 与 v2,只返 v1,双边记录。可在用户看不到 v2 之前就发现崩溃、延迟退化、严重输出漂移。
  2. Canary:1–5% 流量打到 v2,按预设 KPI 护栏(错误率、p99、业务指标)自动中止。
  3. A/B:50/50(或按 power 计算)拆分 1–2 周。必须先算 MDE(最小可检测效应);常见反模式是提前读数、把噪音当胜利。
  4. Interleaving(排序列表专属):把 v1 与 v2 的结果交织给同一个用户,比用户拆分 A/B 高 10× 样本效率(Chapelle & Li)。

反模式:「我们直接上了 100%」

跳过 shadow + canary 直接 A/B 是事故最大来源。一个有 OOV bug 的 embedding 新版本可能静默把 0.5% 推荐置空,排查耗掉几天。无论多紧急,都要走完 48 小时 shadow。

OpenAI 与 Anthropic 的视角差异

OpenAI 侧重

OpenAI 面试把 ML 平台当作快速迭代的地基:会追问特征新鲜度 SLO、在线特征查询延迟、支撑每周多次发版的 CI/CD 测试。偏爱在线服务类案例(搜索、排序、广告)。

Anthropic 侧重

Anthropic 会从安全优先角度问同样的问题:shadow 模式不可省;回滚必须一键;注册表的每次晋升都要带人工签字字段;离线评估套件必须包含「伤害分片测试」,不能只看准确率。注册表在这里是合规制品,不仅是版本号列表。

无论对谁,获胜的叙事是:先定义生命周期;再依据输入变化速率选择部署形态;用 shadow→canary→A/B 约束风险;把监控与注册表连通,确保 5 分钟内可回滚。