OpenAI ★★ Frequent Hard ScaleGPU FleetRegional

O12 · Design ChatGPT for 100M Users O12 · 设计承载 1 亿用户的 ChatGPT

Verified source经核实出处

Prompt: "Design ChatGPT to handle 100M users." — Medium / 1Point3Acres reports (2024–2025). Credibility C.

Capacity envelope容量估算

  • 100M MAU → ~10M DAU → ~5 msg/user/day = 50M msg/day = ~600 msg/sec avg, ~3–5K/sec peak.100M MAU → ~10M DAU → ~5 msg/人/天 = 50M msg/天 = 平均 ~600 msg/秒,峰值 ~3–5K/秒。
  • Each msg ≈ 500 tokens in, 500 out → ~600 × 1K × 3600s/h = GPU-hours math needed.每条消息 ≈ 500 in + 500 out tokens → ~600 × 1K × 3600s/h = GPU 时级换算。

Architecture架构

flowchart LR
  U[Users] --> CDN[Edge / CDN]
  CDN --> ALB[Regional LB]
  ALB --> GW[API Gateway]
  GW --> SESS[Session Service]
  SESS --> CONV[(Conversation Store)]
  GW --> ROUTE[Model Router]
  ROUTE --> INF[Inference Cluster]
  INF --> BATCH[Batcher] --> GPU[GPU Fleet]
  GW --> METER[Usage Metering]
  GW --> SAFE[Safety Pipeline]

Key decisions关键决策

  • Regional routing by user home region (latency + data residency).区域路由按用户 home region(延迟 + 数据合规)。
  • Session stickiness to one region; with async replication of conversation metadata.会话黏附到单区域;对话元数据异步跨区复制。
  • Capacity planning per GPU family: A100/H100 pools; autoscale on tokens/sec, not CPU.容量规划分 GPU 家族:A100/H100 池;基于 tokens/sec 而非 CPU 自动扩容。
  • Degraded modes: when saturated, serve smaller/cheaper model or cached prompt prefixes.降级:过载时转小/便宜模型或缓存 prefix。

Where cost hides成本陷阱

$/token dominates, not compute-hours. Answer: batching + KV cache + prompt caching reduce $/token; free/paid tier split by rate limits and model availability.成本主体是 $/token,而非机器小时。答:batching + KV cache + prompt caching 降 $/token;按限流与模型可用性划分免费/付费等级。

Related study-guide topics相关学习手册专题