OpenAI ★★ Frequent Medium LogsTracingSampling

O36 · Design a Distributed Log / Trace Pipeline O36 · 设计分布式日志 / 链路追踪流水线

Verified source经核实出处

Classic observability problem adapted at OpenAI onsites (一亩三分地, 2025). Credibility B.

Architecture架构

flowchart LR
  APP[Apps / agents] --> COL[OTEL Collector]
  COL --> BUS[(Kafka)]
  BUS --> HOT[(Hot store - 7d)]
  BUS --> COLD[(Cold S3/Parquet - 1y)]
  HOT --> SEARCH[Search UI]
  COLD --> LAKE[Lakehouse queries]
  BUS --> SAMP[Adaptive sampler]

Key decisions关键决策

  • **Tail-based sampling** for traces: keep errors/slow, drop normal; dramatic cost reduction.**Trace tail-based 采样**:保留错误/慢请求,丢正常;成本大降。
  • **Tiered retention**: 7d hot (search), 1y cold (Parquet on object store).**分级保留**:7 天热(搜索)、1 年冷(对象存储 Parquet)。
  • **Schema-aware columnar storage** for low-cost slice-dice (GPU_id, model_id).**结构化列存**低成本切片。
  • **Cardinality guards**: enforce per-label unique-value cap; reject metrics with user_id in labels.**基数守护**:每 label 唯一值上限;禁止 user_id 进标签。

Follow-ups追问

  • Semantics? at-least-once + idempotent sinks.语义?at-least-once + 幂等 sink。
  • Query latency on 1y data? partitioned Parquet + predicate pushdown.1 年数据查询?Parquet 分区 + 谓词下推。

Related study-guide topics相关学习手册专题