Drift Detection & Monitoring

Source cross-reference

Chip Huyen Ch.8 "Data Distribution Shifts and Monitoring" and Ch.9 "Continual Learning and Test in Production". The terminology (covariate / label / concept drift) is from the Chapelle 2009 tutorial; Chip Huyen is the practitioner synthesis you should cite in interviews.

The three drifts

A model learns a joint distribution P(X, Y). Production can violate the training assumption in three distinct ways, and the response for each differs:

Data (covariate) drift: P(X) changes while P(Y|X) stays stable. Example: traffic mix shifts to mobile after an app launch; features like screen-size distribution move. Fix: re-weight training data or retrain with fresh features.
Label drift: P(Y) changes. Example: fraud rate triples during a holiday. Often co-occurs with data drift. Fix: recalibrate, adjust thresholds, or retrain.
Concept drift: P(Y|X) changes — the underlying relationship shifts. Example: a spam classifier trained pre-LLM stops catching LLM-generated phishing; the same features now mean something different. Fix: retrain is mandatory; re-weighting cannot recover concept drift.

flowchart TB
  drift{Which part of
P(X,Y) changed?}
  drift -- "P(X) only" --> d1[Covariate drift
-> reweight / retrain]
  drift -- "P(Y) only" --> d2[Label drift
-> recalibrate / retrain]
  drift -- "P(Y|X)" --> d3[Concept drift
-> retrain mandatory]

Statistical tests (PSI, KS, KL)

Knowing the three drifts matters only if you can measure them. Four workhorse tests:

PSI (Population Stability Index) for binned univariate distributions: PSI = Σ (p_i − q_i) · ln(p_i / q_i). Rules of thumb: <0.1 stable, 0.1–0.25 moderate shift, >0.25 significant shift. PSI is the standard in banking; it is cheap, interpretable, and catches the common case.
Kolmogorov-Smirnov (KS) test: non-parametric test on continuous distributions, uses max CDF gap. Good for small samples and heavy-tailed features. Report p-value with a Bonferroni correction if you are testing many features.
KL divergence: KL(p||q) = Σ p log(p/q). Asymmetric; use Jensen-Shannon if you want symmetric. Strong sensitivity to tail differences; most useful when you already trust your binning.
Chi-squared for categorical features.

Concrete numbers

Monitor at two cadences: (a) hourly PSI on top-20 features with a 7-day rolling reference; (b) daily KS on each continuous feature with training distribution as reference. Alert SLO: 3 consecutive intervals over threshold. Typical false-positive rate target < 2 alerts/week per model.

Monitoring stack & SLOs

A full monitoring stack has four layers:

Feature monitoring: PSI/KS/chi-squared on every input feature, plus missingness rate and unique-value drift.
Prediction monitoring: output distribution drift, score calibration (Brier, ECE), and "prediction entropy" (does the model collapse to one class?).
Outcome monitoring: online business metric (CTR, conversion, revenue) with confidence intervals and per-slice breakdowns.
System monitoring: latency p50/p95/p99, error rate, GPU utilisation, feature-store lookup failures. Shared with non-ML services.

Off-the-shelf tools: Arize AI and WhyLabs are SaaS with built-in drift dashboards and embeddings monitoring; Evidently is the dominant open-source library — it plugs into Airflow or a notebook and emits HTML/JSON drift reports. For custom stacks, Prometheus + Grafana for numerics plus Great Expectations for schema is a minimum viable combo. SageMaker Model Monitor and Vertex AI Model Monitoring are the default on their respective clouds.

Retraining triggers

Chip Huyen Ch.9 lists four trigger policies; interviews love when you enumerate trade-offs:

Scheduled: every N days. Simple, predictable, wastes compute when data is stable, too slow for fast-moving domains.
Data-based: retrain when drift metric crosses threshold for K consecutive windows. Sensitive to metric choice.
Performance-based: retrain when online metric drops by ΔX vs rolling baseline. Requires labels — expensive or delayed in many domains.
Continual: micro-batches every few minutes, online learning. Only justified where domain shifts within hours (news feed, fraud).

A pragmatic default: weekly scheduled + drift-triggered early retrain + performance-triggered emergency retrain. Always pair with the shadow/canary discipline from the lifecycle page — a retrain is a new model and must go through the full rollout ladder.

LLM-specific drift

Neither Chip Huyen book fully covers what OpenAI and Anthropic interviewers now ask. Three LLM-specific drifts:

Prompt drift: your prompt template includes a phrase a provider fine-tune begins handling differently (e.g. behaviour on "system: you are a helpful assistant" changes between GPT-4 minor versions). Detect by running a golden-prompt eval suite nightly and alerting on response distribution shift.
Output drift: same input, different outputs, due to upstream model updates or temperature creep. Measure via output embedding similarity (sentence-transformers) against a reference fixture set, alert if cosine drops below 0.85 on >5% of fixtures.
User-population drift: the kinds of questions users ask shift. Detect by clustering embeddings of live prompts weekly and tracking cluster mass. Rising new clusters = new use case = new evaluation needs.

flowchart LR
  prod[Live LLM traffic] --> emb[Embed prompts]
  emb --> cluster[Weekly cluster
HDBSCAN]
  cluster --> compare{Compare vs
last week}
  compare -- new cluster --> alert[Alert: new use case]
  compare -- cluster mass shift --> eval[Trigger re-eval + maybe retrain]
  prod --> golden[Run golden
prompt suite]
  golden --> sim[Output similarity
vs fixtures]
  sim --> alert

What trips up candidates

Anti-patterns

"We monitor accuracy" — in most domains labels arrive days or weeks late (delayed feedback). You cannot rely on accuracy as the primary alert signal; you need feature and prediction monitoring that fires before labels land.
"PSI on everything" — with 10,000 features, you will get thousands of alerts daily. Rank features by training-time importance and monitor only the top 20 aggressively.
"Retrain on every drift alert" — creates a feedback loop that amplifies noise. Gate retrains on sustained drift + offline improvement test.

OpenAI framing

Expect questions like "how would you know your routing model degraded?" — probe around prediction entropy and per-tenant slice metrics, since one customer's workload can mask aggregate drift.

Anthropic framing

Drift monitoring becomes a safety signal: a shift in refusal rate or jailbreak-attempt-per-1k-prompts is as important as accuracy. Expect the discussion to include "what would you escalate to the on-call safety reviewer?" — answer with clear thresholds and a human-in-the-loop path.

来源对照

Chip Huyen 第 8 章「数据分布偏移与监控」与 第 9 章「持续学习与线上测试」。术语（协变量/标签/概念漂移）来自 Chapelle 2009 tutorial；Chip Huyen 是面试里最值得引用的实践综合源。

三种漂移

模型学的是联合分布 P(X, Y)。生产环境有三种打破训练假设的方式，每一种的应对不同：

数据（协变量）漂移：P(X) 变了，P(Y|X) 没变。例：App 发版后移动端流量上升，屏幕尺寸分布改变。对策：重加权训练数据或用新特征重训。
标签漂移：P(Y) 变了。例：假日期间欺诈率翻三倍。常与数据漂移并发。对策：重校准、调阈值或重训。
概念漂移：P(Y|X) 变了——底层关系变了。例：LLM 时代前训练的垃圾分类器抓不到 LLM 生成的钓鱼，同样特征含义已变。对策：必须重训，重加权救不了概念漂移。

flowchart TB
  drift{P(X,Y) 的哪部分
变了？}
  drift -- "仅 P(X)" --> d1[协变量漂移
-> 重加权/重训]
  drift -- "仅 P(Y)" --> d2[标签漂移
-> 重校准/重训]
  drift -- "P(Y|X)" --> d3[概念漂移
-> 必须重训]

统计检验（PSI、KS、KL）

三种漂移必须能被测出来才有意义。四个常用检验：

PSI（群体稳定性指数），对分箱单变量：PSI = Σ (p_i − q_i) · ln(p_i / q_i)。经验阈值：<0.1 稳定，0.1–0.25 中度偏移，>0.25 显著偏移。银行业标配，便宜、可解释、能抓常见情况。
KS 检验：连续分布的非参检验，用 CDF 最大差。适合小样本和重尾特征。多特征同时测时要做 Bonferroni 校正。
KL 散度：KL(p||q) = Σ p log(p/q)，不对称——想对称请用 JS 散度。对尾部差异敏感，适合已信任分箱的场景。
卡方：处理类别型特征。

具体数字

两种监控节奏：(a) 每小时对 Top-20 特征算 PSI，参考窗口 7 天滚动；(b) 每日对连续特征做 KS，参考训练分布。告警 SLO：连续 3 个窗口越阈值。单模型假阳性目标 < 2 次/周。

监控栈与 SLO

完整监控栈有四层：

特征监控：每个输入的 PSI/KS/卡方，加上缺失率与唯一值漂移。
预测监控：输出分布漂移、校准度（Brier、ECE）、「预测熵」（是否塌缩到某一类）。
结果监控：线上业务指标（CTR、转化、收入），含置信区间与分片分解。
系统监控：延迟 p50/p95/p99、错误率、GPU 利用率、特征查询失败。与非 ML 服务共用。

现成工具：Arize AI、WhyLabs 是 SaaS，自带漂移与 embedding 监控面板；Evidently 是主流开源库，可接入 Airflow 或 Notebook 输出 HTML/JSON 漂移报告。自研最小组合：Prometheus + Grafana 看数值 + Great Expectations 看 schema。SageMaker Model Monitor、Vertex AI Model Monitoring 分别是 AWS / GCP 的默认选择。

再训练触发条件

Chip Huyen Ch.9 列出四种触发策略，面试官喜欢看你把权衡讲清楚：

定时：每 N 天一次。简单、可预期；数据稳定时浪费算力，高速变化场景跟不上。
数据触发：漂移指标连续 K 个窗口越阈值。对指标选择敏感。
性能触发：线上指标相对滚动基线下降 ΔX 时触发。需要标签——很多场景标签贵或迟到。
持续学习：分钟级微批在线学习。只有在领域几小时内就变化（新闻流、欺诈）时才值得。

务实默认：周定时 + 漂移触发提前重训 + 性能触发紧急重训。必须配合生命周期页讲过的 shadow/canary 流程——重训出来的就是新模型，必须走完灰度梯子。

LLM 专属漂移

两本 Chip Huyen 都未完全覆盖 OpenAI / Anthropic 现在会问的 LLM 场景。三种 LLM 专属漂移：

Prompt drift：prompt 模板里的一句话在模型提供方小版本更新后行为变化（例如 GPT-4 两个小版本对「system: you are a helpful assistant」的处理不同）。对策：夜间跑「金 prompt 套件」，告警响应分布偏移。
Output drift：同输入输出变了，来源可能是上游模型更新或温度漂移。用 sentence-transformers 把输出 embed，对固定 fixture 集做相似度，>5% 的 fixture 余弦 < 0.85 时告警。
用户群漂移：用户提问的种类变化。每周对线上 prompt embedding 做聚类，追踪簇质量。新簇涌现 = 新用例 = 需要新评估。

flowchart LR
  prod[线上 LLM 流量] --> emb[Prompt embedding]
  emb --> cluster[每周 HDBSCAN
聚类]
  cluster --> compare{对比上周}
  compare -- 新簇 --> alert[告警：新用例]
  compare -- 簇质量偏移 --> eval[触发再评估 / 重训]
  prod --> golden[跑金 prompt
套件]
  golden --> sim[输出相似度
对比 fixture]
  sim --> alert

候选人常见翻车点

反模式

「我们监控准确率」——多数业务的标签延迟几天到几周。把准确率当主告警信号会慢到看不见事故；必须靠特征和预测监控提前发现。
「给所有特征都上 PSI」——10,000 个特征每天会出几千条告警。按训练期重要性取前 20 重点监控即可。
「漂移一响就重训」——会形成放大噪声的反馈循环。重训要用「持续漂移 + 离线改进测试」把关。

OpenAI 侧重

常考「你怎么知道路由模型退化了？」——重点讲预测熵与按租户分片的指标，因为某个客户的流量会掩盖整体漂移。

Anthropic 侧重

漂移监控本身就是安全信号：拒答率变化或每千次提问中的越狱尝试数跟准确率一样重要。常被问「你会把什么上报给值班的安全审核员？」——答出清晰阈值与人工介入流程。