Google ★★ Frequent Hard LLM ServingTPUMultimodal

G1 · Design Gemini API Serving G1 · 设计 Gemini API 推理服务

Verified source经核实出处

Gemini is a public Google product (ai.google.dev). Asked at Google L5/L6 AI-team onsites. Credibility A.

Architecture架构

flowchart LR
  Client --> GW[Global GW - auth, rate limit]
  GW --> ROUTE[Region router]
  ROUTE --> TPU1[TPU pod - US]
  ROUTE --> TPU2[TPU pod - EU]
  ROUTE --> CACHE[Prefix cache]
  TPU1 --> MM[Multimodal preproc]
  MM --> MODEL[Gemini model]

Key decisions关键决策

  • **TPU pods over GPU**: Google's first-party hardware; higher FLOPs/$ for large-context models.**用 TPU pod**:Google 自研硬件;大上下文下 FLOPs/$ 更优。
  • **Long-context sharding**: 2 M tokens needs ring-attention / chunk-wise KV sharding across TPU cores.**长上下文分片**:2 M token 需 ring-attention / 按 chunk 在 TPU core 分片 KV。
  • **Multimodal preprocessing**: image/video tokenised on edge (Vertex AI pipeline) to amortise TPU time.**多模态预处理**:图像/视频在边缘分词(Vertex AI),摊销 TPU 用时。
  • **Region affinity for data residency** (EU, US, Asia) same as O38.**按数据驻留区域亲和**(EU/US/Asia),与 O38 类同。

Follow-ups追问

  • Latency for 2M-token prompt? context caching + disaggregated prefill.2M 上下文延迟?context caching + 预填分离。

Related study-guide topics相关学习手册专题