A16 · Low-Latency ML Inference API A16 · 低延迟 ML 推理 API
Verified source经核实出处
Prompt: "Design a low-latency ML inference API…SLOs…feature retrieval…canary/rollbacks…drift detection" — PracHub, Onsite. Credibility B.
Three numbers you must have必须给出的三个数字
- p95 latency (e.g. 50–150ms depending on business).p95 延迟(如 50–150ms,按业务)。
- Availability SLO (99.9 / 99.99).可用性 SLO(99.9 / 99.99)。
- QPS / throughput for capacity math.QPS/吞吐用于容量规划。
Architecture架构
flowchart LR U[Product Service] --> API[Inference API] API --> FS[Feature Store] API --> MS[Model Server] MS --> API API --> MON[Metrics + Drift]
Feature store realismFeature store 的现实
- Online store must be low-latency. Training-serving skew is the #1 bug.在线 store 必须低延迟。Training-serving skew 是头号 bug。
- Cache hot features with TTL + version.热特征缓存 + TTL + 版本。
Fallback (don't crash on partial failure)降级(部分故障时不崩)
- Feature store slow → return default features / smaller model.Feature store 慢 → 返回默认特征 / 小模型。
- GPU pool starved → switch to CPU / smaller model.GPU 池不足 → 切 CPU / 小模型。
- Model anomaly → rollback to previous (always keep prior warm).模型异常 → 回滚上一个(上一版始终保持预热)。