Anthropic ★★★ Frequent Hard Feature StoreDrift

A16 · Low-Latency ML Inference API A16 · 低延迟 ML 推理 API

Verified source经核实出处

Prompt: "Design a low-latency ML inference API…SLOs…feature retrieval…canary/rollbacks…drift detection" — PracHub, Onsite. Credibility B.

Three numbers you must have必须给出的三个数字

p95 latency (e.g. 50–150ms depending on business).p95 延迟（如 50–150ms，按业务）。
Availability SLO (99.9 / 99.99).可用性 SLO（99.9 / 99.99）。
QPS / throughput for capacity math.QPS/吞吐用于容量规划。

Architecture架构

flowchart LR
  U[Product Service] --> API[Inference API]
  API --> FS[Feature Store]
  API --> MS[Model Server]
  MS --> API
  API --> MON[Metrics + Drift]

Feature store realismFeature store 的现实

Online store must be low-latency. Training-serving skew is the #1 bug.在线 store 必须低延迟。Training-serving skew 是头号 bug。
Cache hot features with TTL + version.热特征缓存 + TTL + 版本。

Fallback (don't crash on partial failure)降级（部分故障时不崩）

Feature store slow → return default features / smaller model.Feature store 慢 → 返回默认特征 / 小模型。
GPU pool starved → switch to CPU / smaller model.GPU 池不足 → 切 CPU / 小模型。
Model anomaly → rollback to previous (always keep prior warm).模型异常 → 回滚上一个（上一版始终保持预热）。

Related study-guide topics相关学习手册专题