Google ★★ Frequent Hard LLM ServingTPUMultimodal

G1 · Design Gemini API Serving G1 · 设计 Gemini API 推理服务

Verified source经核实出处

Gemini is a public Google product (ai.google.dev). Asked at Google L5/L6 AI-team onsites. Credibility A.

Architecture架构

flowchart LR
  Client --> GW[Global GW - auth, rate limit]
  GW --> ROUTE[Region router]
  ROUTE --> TPU1[TPU pod - US]
  ROUTE --> TPU2[TPU pod - EU]
  ROUTE --> CACHE[Prefix cache]
  TPU1 --> MM[Multimodal preproc]
  MM --> MODEL[Gemini model]

Key decisions关键决策

**TPU pods over GPU**: Google's first-party hardware; higher FLOPs/$ for large-context models.**用 TPU pod**：Google 自研硬件；大上下文下 FLOPs/$ 更优。
**Long-context sharding**: 2 M tokens needs ring-attention / chunk-wise KV sharding across TPU cores.**长上下文分片**：2 M token 需 ring-attention / 按 chunk 在 TPU core 分片 KV。
**Multimodal preprocessing**: image/video tokenised on edge (Vertex AI pipeline) to amortise TPU time.**多模态预处理**：图像/视频在边缘分词（Vertex AI），摊销 TPU 用时。
**Region affinity for data residency** (EU, US, Asia) same as O38.**按数据驻留区域亲和**（EU/US/Asia），与 O38 类同。

Follow-ups追问

Latency for 2M-token prompt? context caching + disaggregated prefill.2M 上下文延迟？context caching + 预填分离。

G1 · Design Gemini API Serving G1 · 设计 Gemini API 推理服务

Verified source经核实出处

Architecture架构

Key decisions关键决策

Follow-ups追问

Related study-guide topics相关学习手册专题