X3 · Design the X Firehose Ingestion for Grok Training X3 · 设计供 Grok 训练使用的 X 全量 Firehose 摄入
Verified source经核实出处
xAI has exclusive access to X's full firehose (stated publicly by Elon). This question mirrors classic Twitter streaming-ingestion interview questions. Credibility B.
Problem问题
X produces ~500M posts/day (not counting likes, reposts, views). xAI needs to ingest this firehose for (1) Grok training corpus and (2) real-time features in Grok answers. Design the pipeline end to end.X 每天产生约 5 亿条帖子(不含点赞、转发、浏览)。xAI 需要摄入 firehose 用于(1)Grok 训练语料;(2)Grok 回答的实时特征。端到端设计这条流水线。
Architecture架构
flowchart LR X[X services] --> K[Kafka firehose] K --> EN[Enrichment] EN --> F1[Training sink: Parquet/S3] EN --> F2[Real-time index] EN --> F3[Safety filter queue] F2 --> RS[Realtime serving for Grok] F3 --> MOD[Moderation + opt-out removal]
Key decisions关键决策
- Kafka with partition-by-user_id for ordering; replication factor 3 across DCs.Kafka 按 user_id 分区保证有序;跨数据中心副本因子 3。
- Enrichment adds language detect, topic classification, author credibility score — done as Flink/Spark streaming job.Enrichment 增加语言检测、主题分类、作者可信度评分——用 Flink/Spark 流作业实现。
- Training sink writes hourly Parquet partitioned by date/lang — feeds nightly training data refresh.训练 sink 按日期/语言分区每小时落盘 Parquet——用于每夜训练数据刷新。
- Respect X users' data opt-outs via tombstones pushed through the same pipeline.通过同一条流水线下发 tombstone,尊重 X 用户的数据退出设置。
- PII scrubbing before any data leaves the secure X VPC into xAI training clusters.数据离开 X 安全 VPC 进入 xAI 训练集群前完成 PII 清洗。
Follow-ups追问
- Backfill: how to reprocess 6 months of history when you add a new enrichment field?回填:新增一个 enrichment 字段时如何重放 6 个月的历史?
- Dedup: retweets and quote-posts duplicate content — how do you dedup at training time?去重:转发和引用会重复内容——训练时如何去重?