Source cross-reference
Chip Huyen Ch.8 "Data Distribution Shifts and Monitoring" and Ch.9 "Continual Learning and Test in Production". The terminology (covariate / label / concept drift) is from the Chapelle 2009 tutorial; Chip Huyen is the practitioner synthesis you should cite in interviews.
The three drifts
A model learns a joint distribution P(X, Y). Production can violate the training assumption in three distinct ways, and the response for each differs:
- Data (covariate) drift:
P(X)changes whileP(Y|X)stays stable. Example: traffic mix shifts to mobile after an app launch; features like screen-size distribution move. Fix: re-weight training data or retrain with fresh features. - Label drift:
P(Y)changes. Example: fraud rate triples during a holiday. Often co-occurs with data drift. Fix: recalibrate, adjust thresholds, or retrain. - Concept drift:
P(Y|X)changes — the underlying relationship shifts. Example: a spam classifier trained pre-LLM stops catching LLM-generated phishing; the same features now mean something different. Fix: retrain is mandatory; re-weighting cannot recover concept drift.
flowchart TB
drift{Which part of
P(X,Y) changed?}
drift -- "P(X) only" --> d1[Covariate drift
-> reweight / retrain]
drift -- "P(Y) only" --> d2[Label drift
-> recalibrate / retrain]
drift -- "P(Y|X)" --> d3[Concept drift
-> retrain mandatory]
Statistical tests (PSI, KS, KL)
Knowing the three drifts matters only if you can measure them. Four workhorse tests:
- PSI (Population Stability Index) for binned univariate distributions:
PSI = Σ (p_i − q_i) · ln(p_i / q_i). Rules of thumb: <0.1 stable, 0.1–0.25 moderate shift, >0.25 significant shift. PSI is the standard in banking; it is cheap, interpretable, and catches the common case. - Kolmogorov-Smirnov (KS) test: non-parametric test on continuous distributions, uses max CDF gap. Good for small samples and heavy-tailed features. Report p-value with a Bonferroni correction if you are testing many features.
- KL divergence:
KL(p||q) = Σ p log(p/q). Asymmetric; use Jensen-Shannon if you want symmetric. Strong sensitivity to tail differences; most useful when you already trust your binning. - Chi-squared for categorical features.
Concrete numbers
Monitor at two cadences: (a) hourly PSI on top-20 features with a 7-day rolling reference; (b) daily KS on each continuous feature with training distribution as reference. Alert SLO: 3 consecutive intervals over threshold. Typical false-positive rate target < 2 alerts/week per model.
Monitoring stack & SLOs
A full monitoring stack has four layers:
- Feature monitoring: PSI/KS/chi-squared on every input feature, plus missingness rate and unique-value drift.
- Prediction monitoring: output distribution drift, score calibration (Brier, ECE), and "prediction entropy" (does the model collapse to one class?).
- Outcome monitoring: online business metric (CTR, conversion, revenue) with confidence intervals and per-slice breakdowns.
- System monitoring: latency p50/p95/p99, error rate, GPU utilisation, feature-store lookup failures. Shared with non-ML services.
Off-the-shelf tools: Arize AI and WhyLabs are SaaS with built-in drift dashboards and embeddings monitoring; Evidently is the dominant open-source library — it plugs into Airflow or a notebook and emits HTML/JSON drift reports. For custom stacks, Prometheus + Grafana for numerics plus Great Expectations for schema is a minimum viable combo. SageMaker Model Monitor and Vertex AI Model Monitoring are the default on their respective clouds.
Retraining triggers
Chip Huyen Ch.9 lists four trigger policies; interviews love when you enumerate trade-offs:
- Scheduled: every N days. Simple, predictable, wastes compute when data is stable, too slow for fast-moving domains.
- Data-based: retrain when drift metric crosses threshold for K consecutive windows. Sensitive to metric choice.
- Performance-based: retrain when online metric drops by ΔX vs rolling baseline. Requires labels — expensive or delayed in many domains.
- Continual: micro-batches every few minutes, online learning. Only justified where domain shifts within hours (news feed, fraud).
A pragmatic default: weekly scheduled + drift-triggered early retrain + performance-triggered emergency retrain. Always pair with the shadow/canary discipline from the lifecycle page — a retrain is a new model and must go through the full rollout ladder.
LLM-specific drift
Neither Chip Huyen book fully covers what OpenAI and Anthropic interviewers now ask. Three LLM-specific drifts:
- Prompt drift: your prompt template includes a phrase a provider fine-tune begins handling differently (e.g. behaviour on "system: you are a helpful assistant" changes between GPT-4 minor versions). Detect by running a golden-prompt eval suite nightly and alerting on response distribution shift.
- Output drift: same input, different outputs, due to upstream model updates or temperature creep. Measure via output embedding similarity (sentence-transformers) against a reference fixture set, alert if cosine drops below 0.85 on >5% of fixtures.
- User-population drift: the kinds of questions users ask shift. Detect by clustering embeddings of live prompts weekly and tracking cluster mass. Rising new clusters = new use case = new evaluation needs.
flowchart LR prod[Live LLM traffic] --> emb[Embed prompts] emb --> cluster[Weekly cluster
HDBSCAN] cluster --> compare{Compare vs
last week} compare -- new cluster --> alert[Alert: new use case] compare -- cluster mass shift --> eval[Trigger re-eval + maybe retrain] prod --> golden[Run golden
prompt suite] golden --> sim[Output similarity
vs fixtures] sim --> alert
What trips up candidates
Anti-patterns
- "We monitor accuracy" — in most domains labels arrive days or weeks late (delayed feedback). You cannot rely on accuracy as the primary alert signal; you need feature and prediction monitoring that fires before labels land.
- "PSI on everything" — with 10,000 features, you will get thousands of alerts daily. Rank features by training-time importance and monitor only the top 20 aggressively.
- "Retrain on every drift alert" — creates a feedback loop that amplifies noise. Gate retrains on sustained drift + offline improvement test.
OpenAI framing
Expect questions like "how would you know your routing model degraded?" — probe around prediction entropy and per-tenant slice metrics, since one customer's workload can mask aggregate drift.
Anthropic framing
Drift monitoring becomes a safety signal: a shift in refusal rate or jailbreak-attempt-per-1k-prompts is as important as accuracy. Expect the discussion to include "what would you escalate to the on-call safety reviewer?" — answer with clear thresholds and a human-in-the-loop path.
来源对照
Chip Huyen 第 8 章「数据分布偏移与监控」与 第 9 章「持续学习与线上测试」。术语(协变量/标签/概念漂移)来自 Chapelle 2009 tutorial;Chip Huyen 是面试里最值得引用的实践综合源。
三种漂移
模型学的是联合分布 P(X, Y)。生产环境有三种打破训练假设的方式,每一种的应对不同:
- 数据(协变量)漂移:
P(X)变了,P(Y|X)没变。例:App 发版后移动端流量上升,屏幕尺寸分布改变。对策:重加权训练数据或用新特征重训。 - 标签漂移:
P(Y)变了。例:假日期间欺诈率翻三倍。常与数据漂移并发。对策:重校准、调阈值或重训。 - 概念漂移:
P(Y|X)变了——底层关系变了。例:LLM 时代前训练的垃圾分类器抓不到 LLM 生成的钓鱼,同样特征含义已变。对策:必须重训,重加权救不了概念漂移。
flowchart TB
drift{P(X,Y) 的哪部分
变了?}
drift -- "仅 P(X)" --> d1[协变量漂移
-> 重加权/重训]
drift -- "仅 P(Y)" --> d2[标签漂移
-> 重校准/重训]
drift -- "P(Y|X)" --> d3[概念漂移
-> 必须重训]
统计检验(PSI、KS、KL)
三种漂移必须能被测出来才有意义。四个常用检验:
- PSI(群体稳定性指数),对分箱单变量:
PSI = Σ (p_i − q_i) · ln(p_i / q_i)。经验阈值:<0.1 稳定,0.1–0.25 中度偏移,>0.25 显著偏移。银行业标配,便宜、可解释、能抓常见情况。 - KS 检验:连续分布的非参检验,用 CDF 最大差。适合小样本和重尾特征。多特征同时测时要做 Bonferroni 校正。
- KL 散度:
KL(p||q) = Σ p log(p/q),不对称——想对称请用 JS 散度。对尾部差异敏感,适合已信任分箱的场景。 - 卡方:处理类别型特征。
具体数字
两种监控节奏:(a) 每小时对 Top-20 特征算 PSI,参考窗口 7 天滚动;(b) 每日对连续特征做 KS,参考训练分布。告警 SLO:连续 3 个窗口越阈值。单模型假阳性目标 < 2 次/周。
监控栈与 SLO
完整监控栈有四层:
- 特征监控:每个输入的 PSI/KS/卡方,加上缺失率与唯一值漂移。
- 预测监控:输出分布漂移、校准度(Brier、ECE)、「预测熵」(是否塌缩到某一类)。
- 结果监控:线上业务指标(CTR、转化、收入),含置信区间与分片分解。
- 系统监控:延迟 p50/p95/p99、错误率、GPU 利用率、特征查询失败。与非 ML 服务共用。
现成工具:Arize AI、WhyLabs 是 SaaS,自带漂移与 embedding 监控面板;Evidently 是主流开源库,可接入 Airflow 或 Notebook 输出 HTML/JSON 漂移报告。自研最小组合:Prometheus + Grafana 看数值 + Great Expectations 看 schema。SageMaker Model Monitor、Vertex AI Model Monitoring 分别是 AWS / GCP 的默认选择。
再训练触发条件
Chip Huyen Ch.9 列出四种触发策略,面试官喜欢看你把权衡讲清楚:
- 定时:每 N 天一次。简单、可预期;数据稳定时浪费算力,高速变化场景跟不上。
- 数据触发:漂移指标连续 K 个窗口越阈值。对指标选择敏感。
- 性能触发:线上指标相对滚动基线下降 ΔX 时触发。需要标签——很多场景标签贵或迟到。
- 持续学习:分钟级微批在线学习。只有在领域几小时内就变化(新闻流、欺诈)时才值得。
务实默认:周定时 + 漂移触发提前重训 + 性能触发紧急重训。必须配合生命周期页讲过的 shadow/canary 流程——重训出来的就是新模型,必须走完灰度梯子。
LLM 专属漂移
两本 Chip Huyen 都未完全覆盖 OpenAI / Anthropic 现在会问的 LLM 场景。三种 LLM 专属漂移:
- Prompt drift:prompt 模板里的一句话在模型提供方小版本更新后行为变化(例如 GPT-4 两个小版本对「system: you are a helpful assistant」的处理不同)。对策:夜间跑「金 prompt 套件」,告警响应分布偏移。
- Output drift:同输入输出变了,来源可能是上游模型更新或温度漂移。用 sentence-transformers 把输出 embed,对固定 fixture 集做相似度,>5% 的 fixture 余弦 < 0.85 时告警。
- 用户群漂移:用户提问的种类变化。每周对线上 prompt embedding 做聚类,追踪簇质量。新簇涌现 = 新用例 = 需要新评估。
flowchart LR prod[线上 LLM 流量] --> emb[Prompt embedding] emb --> cluster[每周 HDBSCAN
聚类] cluster --> compare{对比上周} compare -- 新簇 --> alert[告警:新用例] compare -- 簇质量偏移 --> eval[触发再评估 / 重训] prod --> golden[跑金 prompt
套件] golden --> sim[输出相似度
对比 fixture] sim --> alert
候选人常见翻车点
反模式
- 「我们监控准确率」——多数业务的标签延迟几天到几周。把准确率当主告警信号会慢到看不见事故;必须靠特征和预测监控提前发现。
- 「给所有特征都上 PSI」——10,000 个特征每天会出几千条告警。按训练期重要性取前 20 重点监控即可。
- 「漂移一响就重训」——会形成放大噪声的反馈循环。重训要用「持续漂移 + 离线改进测试」把关。
OpenAI 侧重
常考「你怎么知道路由模型退化了?」——重点讲预测熵与按租户分片的指标,因为某个客户的流量会掩盖整体漂移。
Anthropic 侧重
漂移监控本身就是安全信号:拒答率变化或每千次提问中的越狱尝试数跟准确率一样重要。常被问「你会把什么上报给值班的安全审核员?」——答出清晰阈值与人工介入流程。