A38 · Design Claude's Prompt Caching Service A38 · 设计 Claude 的 Prompt 缓存服务
Verified source经核实出处
Anthropic launched prompt caching 2024 (docs). Credibility A.
Architecture架构
flowchart LR Req --> P[Prefix parser w/ cache_control] P --> H[Hash prefix] H --> ROUTE[Consistent hash to GPU] ROUTE --> GPU[GPU - KV cache pool]
Key decisions关键决策
- **Explicit markers**: user annotates which parts to cache; avoids misleading auto-dedup.**显式标记**:用户标注可缓存部分;避免误判自动去重。
- **Prefix-aligned**: cache covers a continuous leading segment; downstream differences don't invalidate.**前缀对齐**:缓存覆盖连续前缀;后段差异不致失效。
- **5-min TTL** matches typical session turn; avoid hoarding GPU memory for stale prompts.**5 分钟 TTL**贴合会话节奏;避免为陈旧 prompt 占用 GPU 内存。
- **Sticky routing** identical to O31 (OpenAI prompt cache).**粘性路由**与 O31 相同。
Follow-ups追问
- Granularity? minimum 1024 tokens; shorter prefixes not cached.粒度?最小 1024 token;更短不缓存。
- User changes cached prefix slightly? cache miss; re-warm.前缀微改?未命中;重新 warm。