A13 · Inference Routing & Scheduling Layer A13 · 推理路由与调度层
Verified source经核实出处
Prompt: "routing layer…prioritization…dynamic batching…query result cache…credit-based fairness" — PracHub, Onsite. Credibility B.
Three things to nail讲清这三件事就赢一半
- Multi-tenant priority: different tenants / request classes have different SLOs & quotas.多租户优先级:不同租户 / 请求类有不同 SLO 与配额。
- Heterogeneous routing: GPU / CPU / different model versions / hardware pools.异构路由:GPU / CPU / 不同模型版本 / 硬件池。
- Determinism & caching: at temperature=0, cache results (only for reproducible inputs).确定性与缓存:temperature=0 时可缓存结果(仅限可复现输入)。
Architecture架构
flowchart LR API[Front API] --> RT[Router] RT --> PQ[Priority Queues] PQ --> BT[Batcher] BT --> GPU[GPU Pool] RT --> CPU[CPU Pool] RT --> Cache[(Result Cache)]
Credit-based fairness (standard answer)基于 credit 的公平调度
- Weighted Fair Queueing / Deficit Round Robin: prevent a large tenant from monopolizing GPUs.WFQ / DRR:防止大客户独占 GPU。
- Credits burn by token-cost estimate, not request count (LLM cost scales with tokens).Credit 消耗按预估 token 成本,而非请求数(LLM 成本与 token 强相关)。
Cache admission & eviction缓存准入与淘汰
- Cache only reproducible requests: temperature=0, fixed model version, fixed system prompt.只缓存可复现请求:temperature=0 + 固定模型版本 + 固定 system prompt。
- Key:
hash(model_version, prompt_prefix, user_input, tool_state).Key:hash(model_version, prompt_prefix, user_input, tool_state)。 - Eviction: LRU/LFU + cost-aware (a giant response may not be worth caching).淘汰:LRU/LFU + 成本感知(超长响应不一定划算缓存)。