Anthropic ★★★ Frequent Hard WFQResult Cache

A13 · Inference Routing & Scheduling Layer A13 · 推理路由与调度层

Verified source经核实出处

Prompt: "routing layer…prioritization…dynamic batching…query result cache…credit-based fairness" — PracHub, Onsite. Credibility B.

Three things to nail讲清这三件事就赢一半

  1. Multi-tenant priority: different tenants / request classes have different SLOs & quotas.多租户优先级:不同租户 / 请求类有不同 SLO 与配额。
  2. Heterogeneous routing: GPU / CPU / different model versions / hardware pools.异构路由:GPU / CPU / 不同模型版本 / 硬件池。
  3. Determinism & caching: at temperature=0, cache results (only for reproducible inputs).确定性与缓存:temperature=0 时可缓存结果(仅限可复现输入)。

Architecture架构

flowchart LR
  API[Front API] --> RT[Router]
  RT --> PQ[Priority Queues]
  PQ --> BT[Batcher]
  BT --> GPU[GPU Pool]
  RT --> CPU[CPU Pool]
  RT --> Cache[(Result Cache)]

Credit-based fairness (standard answer)基于 credit 的公平调度

  • Weighted Fair Queueing / Deficit Round Robin: prevent a large tenant from monopolizing GPUs.WFQ / DRR:防止大客户独占 GPU。
  • Credits burn by token-cost estimate, not request count (LLM cost scales with tokens).Credit 消耗按预估 token 成本,而非请求数(LLM 成本与 token 强相关)。

Cache admission & eviction缓存准入与淘汰

  • Cache only reproducible requests: temperature=0, fixed model version, fixed system prompt.只缓存可复现请求:temperature=0 + 固定模型版本 + 固定 system prompt。
  • Key: hash(model_version, prompt_prefix, user_input, tool_state).Key:hash(model_version, prompt_prefix, user_input, tool_state)
  • Eviction: LRU/LFU + cost-aware (a giant response may not be worth caching).淘汰:LRU/LFU + 成本感知(超长响应不一定划算缓存)。

Related study-guide topics相关学习手册专题