Source cross-reference

Chip Huyen Ch.8 "Data Distribution Shifts and Monitoring" and Ch.9 "Continual Learning and Test in Production". The terminology (covariate / label / concept drift) is from the Chapelle 2009 tutorial; Chip Huyen is the practitioner synthesis you should cite in interviews.

The three drifts

A model learns a joint distribution P(X, Y). Production can violate the training assumption in three distinct ways, and the response for each differs:

  • Data (covariate) drift: P(X) changes while P(Y|X) stays stable. Example: traffic mix shifts to mobile after an app launch; features like screen-size distribution move. Fix: re-weight training data or retrain with fresh features.
  • Label drift: P(Y) changes. Example: fraud rate triples during a holiday. Often co-occurs with data drift. Fix: recalibrate, adjust thresholds, or retrain.
  • Concept drift: P(Y|X) changes — the underlying relationship shifts. Example: a spam classifier trained pre-LLM stops catching LLM-generated phishing; the same features now mean something different. Fix: retrain is mandatory; re-weighting cannot recover concept drift.
flowchart TB
  drift{Which part of
P(X,Y) changed?} drift -- "P(X) only" --> d1[Covariate drift
-> reweight / retrain] drift -- "P(Y) only" --> d2[Label drift
-> recalibrate / retrain] drift -- "P(Y|X)" --> d3[Concept drift
-> retrain mandatory]

Statistical tests (PSI, KS, KL)

Knowing the three drifts matters only if you can measure them. Four workhorse tests:

  1. PSI (Population Stability Index) for binned univariate distributions: PSI = Σ (p_i − q_i) · ln(p_i / q_i). Rules of thumb: <0.1 stable, 0.1–0.25 moderate shift, >0.25 significant shift. PSI is the standard in banking; it is cheap, interpretable, and catches the common case.
  2. Kolmogorov-Smirnov (KS) test: non-parametric test on continuous distributions, uses max CDF gap. Good for small samples and heavy-tailed features. Report p-value with a Bonferroni correction if you are testing many features.
  3. KL divergence: KL(p||q) = Σ p log(p/q). Asymmetric; use Jensen-Shannon if you want symmetric. Strong sensitivity to tail differences; most useful when you already trust your binning.
  4. Chi-squared for categorical features.

Concrete numbers

Monitor at two cadences: (a) hourly PSI on top-20 features with a 7-day rolling reference; (b) daily KS on each continuous feature with training distribution as reference. Alert SLO: 3 consecutive intervals over threshold. Typical false-positive rate target < 2 alerts/week per model.

Monitoring stack & SLOs

A full monitoring stack has four layers:

  1. Feature monitoring: PSI/KS/chi-squared on every input feature, plus missingness rate and unique-value drift.
  2. Prediction monitoring: output distribution drift, score calibration (Brier, ECE), and "prediction entropy" (does the model collapse to one class?).
  3. Outcome monitoring: online business metric (CTR, conversion, revenue) with confidence intervals and per-slice breakdowns.
  4. System monitoring: latency p50/p95/p99, error rate, GPU utilisation, feature-store lookup failures. Shared with non-ML services.

Off-the-shelf tools: Arize AI and WhyLabs are SaaS with built-in drift dashboards and embeddings monitoring; Evidently is the dominant open-source library — it plugs into Airflow or a notebook and emits HTML/JSON drift reports. For custom stacks, Prometheus + Grafana for numerics plus Great Expectations for schema is a minimum viable combo. SageMaker Model Monitor and Vertex AI Model Monitoring are the default on their respective clouds.

Retraining triggers

Chip Huyen Ch.9 lists four trigger policies; interviews love when you enumerate trade-offs:

  • Scheduled: every N days. Simple, predictable, wastes compute when data is stable, too slow for fast-moving domains.
  • Data-based: retrain when drift metric crosses threshold for K consecutive windows. Sensitive to metric choice.
  • Performance-based: retrain when online metric drops by ΔX vs rolling baseline. Requires labels — expensive or delayed in many domains.
  • Continual: micro-batches every few minutes, online learning. Only justified where domain shifts within hours (news feed, fraud).

A pragmatic default: weekly scheduled + drift-triggered early retrain + performance-triggered emergency retrain. Always pair with the shadow/canary discipline from the lifecycle page — a retrain is a new model and must go through the full rollout ladder.

LLM-specific drift

Neither Chip Huyen book fully covers what OpenAI and Anthropic interviewers now ask. Three LLM-specific drifts:

  • Prompt drift: your prompt template includes a phrase a provider fine-tune begins handling differently (e.g. behaviour on "system: you are a helpful assistant" changes between GPT-4 minor versions). Detect by running a golden-prompt eval suite nightly and alerting on response distribution shift.
  • Output drift: same input, different outputs, due to upstream model updates or temperature creep. Measure via output embedding similarity (sentence-transformers) against a reference fixture set, alert if cosine drops below 0.85 on >5% of fixtures.
  • User-population drift: the kinds of questions users ask shift. Detect by clustering embeddings of live prompts weekly and tracking cluster mass. Rising new clusters = new use case = new evaluation needs.
flowchart LR
  prod[Live LLM traffic] --> emb[Embed prompts]
  emb --> cluster[Weekly cluster
HDBSCAN] cluster --> compare{Compare vs
last week} compare -- new cluster --> alert[Alert: new use case] compare -- cluster mass shift --> eval[Trigger re-eval + maybe retrain] prod --> golden[Run golden
prompt suite] golden --> sim[Output similarity
vs fixtures] sim --> alert

What trips up candidates

Anti-patterns

  • "We monitor accuracy" — in most domains labels arrive days or weeks late (delayed feedback). You cannot rely on accuracy as the primary alert signal; you need feature and prediction monitoring that fires before labels land.
  • "PSI on everything" — with 10,000 features, you will get thousands of alerts daily. Rank features by training-time importance and monitor only the top 20 aggressively.
  • "Retrain on every drift alert" — creates a feedback loop that amplifies noise. Gate retrains on sustained drift + offline improvement test.

OpenAI framing

Expect questions like "how would you know your routing model degraded?" — probe around prediction entropy and per-tenant slice metrics, since one customer's workload can mask aggregate drift.

Anthropic framing

Drift monitoring becomes a safety signal: a shift in refusal rate or jailbreak-attempt-per-1k-prompts is as important as accuracy. Expect the discussion to include "what would you escalate to the on-call safety reviewer?" — answer with clear thresholds and a human-in-the-loop path.

来源对照

Chip Huyen 第 8 章「数据分布偏移与监控」第 9 章「持续学习与线上测试」。术语(协变量/标签/概念漂移)来自 Chapelle 2009 tutorial;Chip Huyen 是面试里最值得引用的实践综合源。

三种漂移

模型学的是联合分布 P(X, Y)。生产环境有三种打破训练假设的方式,每一种的应对不同:

  • 数据(协变量)漂移P(X) 变了,P(Y|X) 没变。例:App 发版后移动端流量上升,屏幕尺寸分布改变。对策:重加权训练数据或用新特征重训。
  • 标签漂移P(Y) 变了。例:假日期间欺诈率翻三倍。常与数据漂移并发。对策:重校准、调阈值或重训。
  • 概念漂移P(Y|X) 变了——底层关系变了。例:LLM 时代前训练的垃圾分类器抓不到 LLM 生成的钓鱼,同样特征含义已变。对策:必须重训,重加权救不了概念漂移。
flowchart TB
  drift{P(X,Y) 的哪部分
变了?} drift -- "仅 P(X)" --> d1[协变量漂移
-> 重加权/重训] drift -- "仅 P(Y)" --> d2[标签漂移
-> 重校准/重训] drift -- "P(Y|X)" --> d3[概念漂移
-> 必须重训]

统计检验(PSI、KS、KL)

三种漂移必须能被测出来才有意义。四个常用检验:

  1. PSI(群体稳定性指数),对分箱单变量:PSI = Σ (p_i − q_i) · ln(p_i / q_i)。经验阈值:<0.1 稳定,0.1–0.25 中度偏移,>0.25 显著偏移。银行业标配,便宜、可解释、能抓常见情况。
  2. KS 检验:连续分布的非参检验,用 CDF 最大差。适合小样本和重尾特征。多特征同时测时要做 Bonferroni 校正。
  3. KL 散度KL(p||q) = Σ p log(p/q),不对称——想对称请用 JS 散度。对尾部差异敏感,适合已信任分箱的场景。
  4. 卡方:处理类别型特征。

具体数字

两种监控节奏:(a) 每小时对 Top-20 特征算 PSI,参考窗口 7 天滚动;(b) 每日对连续特征做 KS,参考训练分布。告警 SLO:连续 3 个窗口越阈值。单模型假阳性目标 < 2 次/周。

监控栈与 SLO

完整监控栈有四层:

  1. 特征监控:每个输入的 PSI/KS/卡方,加上缺失率与唯一值漂移。
  2. 预测监控:输出分布漂移、校准度(Brier、ECE)、「预测熵」(是否塌缩到某一类)。
  3. 结果监控:线上业务指标(CTR、转化、收入),含置信区间与分片分解。
  4. 系统监控:延迟 p50/p95/p99、错误率、GPU 利用率、特征查询失败。与非 ML 服务共用。

现成工具:Arize AIWhyLabs 是 SaaS,自带漂移与 embedding 监控面板;Evidently 是主流开源库,可接入 Airflow 或 Notebook 输出 HTML/JSON 漂移报告。自研最小组合:Prometheus + Grafana 看数值 + Great Expectations 看 schema。SageMaker Model Monitor、Vertex AI Model Monitoring 分别是 AWS / GCP 的默认选择。

再训练触发条件

Chip Huyen Ch.9 列出四种触发策略,面试官喜欢看你把权衡讲清楚:

  • 定时:每 N 天一次。简单、可预期;数据稳定时浪费算力,高速变化场景跟不上。
  • 数据触发:漂移指标连续 K 个窗口越阈值。对指标选择敏感。
  • 性能触发:线上指标相对滚动基线下降 ΔX 时触发。需要标签——很多场景标签贵或迟到。
  • 持续学习:分钟级微批在线学习。只有在领域几小时内就变化(新闻流、欺诈)时才值得。

务实默认:周定时 + 漂移触发提前重训 + 性能触发紧急重训。必须配合生命周期页讲过的 shadow/canary 流程——重训出来的就是新模型,必须走完灰度梯子。

LLM 专属漂移

两本 Chip Huyen 都未完全覆盖 OpenAI / Anthropic 现在会问的 LLM 场景。三种 LLM 专属漂移:

  • Prompt drift:prompt 模板里的一句话在模型提供方小版本更新后行为变化(例如 GPT-4 两个小版本对「system: you are a helpful assistant」的处理不同)。对策:夜间跑「金 prompt 套件」,告警响应分布偏移。
  • Output drift:同输入输出变了,来源可能是上游模型更新或温度漂移。用 sentence-transformers 把输出 embed,对固定 fixture 集做相似度,>5% 的 fixture 余弦 < 0.85 时告警。
  • 用户群漂移:用户提问的种类变化。每周对线上 prompt embedding 做聚类,追踪簇质量。新簇涌现 = 新用例 = 需要新评估。
flowchart LR
  prod[线上 LLM 流量] --> emb[Prompt embedding]
  emb --> cluster[每周 HDBSCAN
聚类] cluster --> compare{对比上周} compare -- 新簇 --> alert[告警:新用例] compare -- 簇质量偏移 --> eval[触发再评估 / 重训] prod --> golden[跑金 prompt
套件] golden --> sim[输出相似度
对比 fixture] sim --> alert

候选人常见翻车点

反模式

  • 「我们监控准确率」——多数业务的标签延迟几天到几周。把准确率当主告警信号会慢到看不见事故;必须靠特征和预测监控提前发现。
  • 「给所有特征都上 PSI」——10,000 个特征每天会出几千条告警。按训练期重要性取前 20 重点监控即可。
  • 「漂移一响就重训」——会形成放大噪声的反馈循环。重训要用「持续漂移 + 离线改进测试」把关。

OpenAI 侧重

常考「你怎么知道路由模型退化了?」——重点讲预测熵与按租户分片的指标,因为某个客户的流量会掩盖整体漂移。

Anthropic 侧重

漂移监控本身就是安全信号:拒答率变化或每千次提问中的越狱尝试数跟准确率一样重要。常被问「你会把什么上报给值班的安全审核员?」——答出清晰阈值与人工介入流程。