Anthropic ★★ Frequent Hard Design ReviewSRE

A17 · Review an Inference API Design for Scale A17 · 评审他人的推理 API 设计

Verified source经核实出处

Prompt: "You are reviewing another engineer's design doc…critique…SLOs…autoscaling…circuit breakers…canary…audit logs…cost controls." — PracHub, Onsite. Credibility B.

Standard answer structure (use as a checklist)标准答题结构（当清单用）

Fill missing SLOs / capacity assumptions first.先补齐缺失的 SLO / 容量假设。
Find single points & failure modes (GPU OOM, hot-swap failure, queue backlog, cross-AZ partition).找单点与故障模式（GPU OOM、热更失败、队列积压、跨 AZ 断链）。
Prioritize changes: safety-first (rate limit / circuit breaker / rollback) → efficiency (batch / cache) → cost (SKU pool / valley-fill).改动优先级：保命（限流/熔断/回滚）→ 提效（batch/缓存）→ 降本（SKU 池/填谷）。

Failure-mode catalog故障模式清单

Overload: admission control + token-level backpressure.过载：准入控制 + token 级背压。
Bad release: canary + automated rollback on error-budget burn.坏发布：canary + 错误预算告警触发自动回滚。
Noisy neighbor: per-tenant isolation, quota, priority queues.嘈杂邻居：按租户隔离、配额、优先级队列。
Data corruption: immutable inputs, output signing, audit log.数据污染：不可变输入、输出签名、审计日志。

Tip技巧

Anchor each recommendation to an SLO change (e.g., "this canary policy cuts blast radius from 100% → 1% for bad releases"). Raw advice without SLO framing loses points.每条建议都锚定到 SLO 变化（如「这套 canary 把坏发布的 blast radius 从 100% 降到 1%」）。不绑 SLO 的建议扣分。

A17 · Review an Inference API Design for Scale A17 · 评审他人的推理 API 设计

Verified source经核实出处

Standard answer structure (use as a checklist)标准答题结构（当清单用）

Failure-mode catalog故障模式清单

Tip技巧

Related study-guide topics相关学习手册专题