G1 · Design Gemini API Serving G1 · 设计 Gemini API 推理服务
Verified source经核实出处
Gemini is a public Google product (ai.google.dev). Asked at Google L5/L6 AI-team onsites. Credibility A.
Architecture架构
flowchart LR Client --> GW[Global GW - auth, rate limit] GW --> ROUTE[Region router] ROUTE --> TPU1[TPU pod - US] ROUTE --> TPU2[TPU pod - EU] ROUTE --> CACHE[Prefix cache] TPU1 --> MM[Multimodal preproc] MM --> MODEL[Gemini model]
Key decisions关键决策
- **TPU pods over GPU**: Google's first-party hardware; higher FLOPs/$ for large-context models.**用 TPU pod**:Google 自研硬件;大上下文下 FLOPs/$ 更优。
- **Long-context sharding**: 2 M tokens needs ring-attention / chunk-wise KV sharding across TPU cores.**长上下文分片**:2 M token 需 ring-attention / 按 chunk 在 TPU core 分片 KV。
- **Multimodal preprocessing**: image/video tokenised on edge (Vertex AI pipeline) to amortise TPU time.**多模态预处理**:图像/视频在边缘分词(Vertex AI),摊销 TPU 用时。
- **Region affinity for data residency** (EU, US, Asia) same as O38.**按数据驻留区域亲和**(EU/US/Asia),与 O38 类同。
Follow-ups追问
- Latency for 2M-token prompt? context caching + disaggregated prefill.2M 上下文延迟?context caching + 预填分离。