OpenAI ★★★ Frequent Medium KV CachePrefixHit Rate

O31 · Design Prompt / Model Cache O31 · 设计 Prompt / 模型缓存

Verified source经核实出处

OpenAI prompt caching launched 2024-10 at 50% discount (docs). Asked at onsites thereafter. Credibility A.

Architecture架构

flowchart LR
  Req --> H[Prefix Hasher - first N tokens]
  H --> R[Router by prefix hash]
  R --> Node1[GPU Node - KV cache pool]
  R --> Node2[GPU Node - KV cache pool]
  Node1 --> Gen
  Node2 --> Gen

Key decisions关键决策

**Tokenise before hashing**: cache key = hash of first 1024 tokens, aligned to block boundary.**先分词再哈希**：key = 前 1024 token 的 hash，对齐 block 边界。
**Consistent-hash routing** keeps same prefix on same GPU, maximising reuse.**一致性哈希路由**让相同前缀落同一 GPU，最大化复用。
**Eviction = LRU with pinning** for hot system prompts; pool sized to hold top-K.**LRU 驱逐 + 固定热 system prompt**；池容纳 top-K。
**Billing**: cache hits discounted; surfaced in response headers.**计费**：命中打折；响应头披露。

Follow-ups追问

Side-channel attack by prefix? tenant-isolated prefix namespace.前缀侧信道？按租户隔离前缀命名空间。
Cross model-version consistency? key includes model version hash.跨版本一致？key 带模型版本 hash。

O31 · Design Prompt / Model Cache O31 · 设计 Prompt / 模型缓存

Verified source经核实出处

Architecture架构

Key decisions关键决策

Follow-ups追问

Related study-guide topics相关学习手册专题