Anthropic ★★★ Frequent Hard WFQResult Cache

A13 · Inference Routing & Scheduling Layer A13 · 推理路由与调度层

Verified source经核实出处

Prompt: "routing layer…prioritization…dynamic batching…query result cache…credit-based fairness" — PracHub, Onsite. Credibility B.

Three things to nail讲清这三件事就赢一半

Multi-tenant priority: different tenants / request classes have different SLOs & quotas.多租户优先级：不同租户 / 请求类有不同 SLO 与配额。
Heterogeneous routing: GPU / CPU / different model versions / hardware pools.异构路由：GPU / CPU / 不同模型版本 / 硬件池。
Determinism & caching: at temperature=0, cache results (only for reproducible inputs).确定性与缓存：temperature=0 时可缓存结果（仅限可复现输入）。

Architecture架构

flowchart LR
  API[Front API] --> RT[Router]
  RT --> PQ[Priority Queues]
  PQ --> BT[Batcher]
  BT --> GPU[GPU Pool]
  RT --> CPU[CPU Pool]
  RT --> Cache[(Result Cache)]

Credit-based fairness (standard answer)基于 credit 的公平调度

Weighted Fair Queueing / Deficit Round Robin: prevent a large tenant from monopolizing GPUs.WFQ / DRR：防止大客户独占 GPU。
Credits burn by token-cost estimate, not request count (LLM cost scales with tokens).Credit 消耗按预估 token 成本，而非请求数（LLM 成本与 token 强相关）。

Cache admission & eviction缓存准入与淘汰

Cache only reproducible requests: temperature=0, fixed model version, fixed system prompt.只缓存可复现请求：temperature=0 + 固定模型版本 + 固定 system prompt。
Key: hash(model_version, prompt_prefix, user_input, tool_state).Key：hash(model_version, prompt_prefix, user_input, tool_state)。
Eviction: LRU/LFU + cost-aware (a giant response may not be worth caching).淘汰：LRU/LFU + 成本感知（超长响应不一定划算缓存）。

Related study-guide topics相关学习手册专题

LLM servingLLM 推理服务