OpenAI ★★★ Frequent Medium KV CachePrefixHit Rate

O31 · Design Prompt / Model Cache O31 · 设计 Prompt / 模型缓存

Verified source经核实出处

OpenAI prompt caching launched 2024-10 at 50% discount (docs). Asked at onsites thereafter. Credibility A.

Architecture架构

flowchart LR
  Req --> H[Prefix Hasher - first N tokens]
  H --> R[Router by prefix hash]
  R --> Node1[GPU Node - KV cache pool]
  R --> Node2[GPU Node - KV cache pool]
  Node1 --> Gen
  Node2 --> Gen

Key decisions关键决策

  • **Tokenise before hashing**: cache key = hash of first 1024 tokens, aligned to block boundary.**先分词再哈希**:key = 前 1024 token 的 hash,对齐 block 边界。
  • **Consistent-hash routing** keeps same prefix on same GPU, maximising reuse.**一致性哈希路由**让相同前缀落同一 GPU,最大化复用。
  • **Eviction = LRU with pinning** for hot system prompts; pool sized to hold top-K.**LRU 驱逐 + 固定热 system prompt**;池容纳 top-K。
  • **Billing**: cache hits discounted; surfaced in response headers.**计费**:命中打折;响应头披露。

Follow-ups追问

  • Side-channel attack by prefix? tenant-isolated prefix namespace.前缀侧信道?按租户隔离前缀命名空间。
  • Cross model-version consistency? key includes model version hash.跨版本一致?key 带模型版本 hash。

Related study-guide topics相关学习手册专题