The log as universal primitive
Kafka's central insight (Narkhede 2011, popularized by Kleppmann's Ch.11): an append-only log is a general-purpose primitive that unifies messaging, storage, and state replication. Producers append; consumers read at their own offsets; the log retains records for a configured window.
- Partitioned: one topic = N partitions; each partition is an ordered log with a single leader (backed by replication).
- Immutable: once written, never rewritten. Enables replay, time-travel debugging, and multiple consumers.
- Bounded retention: 7 days typical; tiered storage (Confluent, Redpanda) extends to months in object storage at 10× lower $/GB.
Source cross-reference
DDIA Ch.10 covers batch (MapReduce, Dataflow), Ch.11 covers streams (logs, exactly-once, event time). Both must be read together.
flowchart LR
P1[Producer A] --> T[[Topic: clicks
partition 0]]
P2[Producer B] --> T2[[Topic: clicks
partition 1]]
T --> C1[Consumer group X
offset=1234]
T --> C2[Consumer group Y
offset=98]
T2 --> C1
T2 --> C2
C1 --> S1[(Sink: warehouse)]
C2 --> S2[(Real-time dashboard)]
Concrete throughput: a single Kafka broker on an r6i.4xlarge with NVMe sustains 1 GB/s in + 1 GB/s out across all its partitions. A LinkedIn-scale cluster (the original) handles 7 trillion messages/day.
Batch processing and MapReduce
MapReduce (Dean & Ghemawat 2004) is the grandparent of modern dataflow: map user-provided function over input partitions, shuffle by key, reduce per-key. Reads bounded input, writes bounded output, assumes you can retry any task.
Key properties:
- Deterministic: same input → same output. Enables retry on task failure without polluting results.
- Materialize intermediate state to disk/HDFS. Slow but robust.
- Latency: minutes to hours. Never a user-facing service.
Modern successors — Spark, Flink batch mode, BigQuery — trade disk materialization for in-memory shuffle, cutting latency ~5×. For LLM training data curation or daily user analytics, you want Spark/BigQuery, not raw MapReduce.
Stream processing primitives
Three operator families:
- Stateless transforms: filter, map. Trivial.
- Windowed aggregations: count per 1-minute tumbling window, per 5-minute sliding window, per session. State = window → running aggregate.
- Stream-stream joins: join two event streams within a time bound. State = buffered side × join-key. Most memory-hungry.
State must be durable (survive operator crash) and recoverable from the log. Flink stores state in RocksDB + periodic checkpoint to S3. Kafka Streams uses a local RocksDB + changelog topic in Kafka itself.
Event time vs processing time, watermarks
Event time = when it happened in the real world; processing time = when your system sees it. A mobile client goes offline on a plane, sends events 6 hours later — event time is from 6 hours ago, processing time is now.
Correct streaming must aggregate by event time. But event time arrivals are out of order: how long do you wait? Enter watermarks (Akidau et al. 2015 Dataflow paper): a watermark W means "no more events with timestamp < W are expected." Windows close when the watermark passes their end.
- Watermark too aggressive → late events are dropped, results are wrong.
- Watermark too conservative → windows close slow, latency shoots up.
- In practice: heuristic watermark (percentile of observed lag) + allowed lateness (emit an update when late events arrive) + dead letter for the truly stale.
Exactly-once (and why it is a lie)
End-to-end "exactly-once" is achievable only if every side effect is idempotent or transactional with the offset commit. DDIA's framing: exactly-once = at-least-once delivery + idempotent processing.
Mechanisms:
- Transactional offsets: Kafka 0.11+ supports transactions that atomically commit output records and consumer offsets. Works for Kafka-to-Kafka.
- Sink idempotency: writes include a unique record ID; the sink dedupes. S3 conditional writes, DynamoDB
ConditionExpression, PostgresON CONFLICT DO NOTHING. - Two-phase commit to an external sink: Flink's TwoPhaseCommitSinkFunction — expensive, but works for JDBC and custom sinks.
Anti-pattern
Claiming "exactly-once" when your sink is a REST endpoint with no idempotency key. You have at-most-once (if you commit offsets before the HTTP call) or at-least-once (if after); you do not have exactly-once. Be honest with the interviewer.
Lambda vs Kappa architectures
Lambda (Marz 2011): run a batch job and a streaming job in parallel over the same data; the batch layer is authoritative, the stream gives low-latency approximate results; a serving layer merges. Pro: each layer tuned independently. Con: two codebases, two bug surfaces, two on-call rotations.
Kappa (Kreps 2014): only a streaming layer; to recompute history, replay the log from the beginning. Pro: one codebase. Con: replay time for a year of events can be long; state-heavy jobs eat RAM.
| Aspect | Lambda | Kappa |
|---|---|---|
| Codebases | 2 (batch + stream) | 1 |
| Backfill | Re-run batch job | Replay from log |
| State size | Lower per-job | Higher (all state in stream) |
| Typical use | Legacy analytics | Modern Flink/Kafka stacks |
When to pick what
- Daily business reports, ML feature tables over last 30 days: pure batch (Spark on S3 or BigQuery). Simplest, cheapest, correct.
- Real-time dashboards, alerting, sub-minute personalization: streaming (Flink / Kafka Streams) with event-time windows.
- Correct long-tail + fresh short-tail (fraud detection, abuse moderation): Kappa with generous late-event reprocessing, idempotent sink.
- Training-data pipelines for LLMs: batch is typical; use streaming only for continuous evals or live safety signals.
OpenAI-specific
Usage metering and token billing are a classic Kappa use case: Kafka topic of every inference event, Flink job aggregates tokens per (org, model, minute), results land in a ledger DB keyed by idempotent (request_id, event_kind). Replay the topic to rebuild ledgers after a bug.
Anthropic-specific
Safety signal pipelines (refusal rates, jailbreak attempts, policy hits) need both low latency and exact historical replay. Anthropic answers: stream processing for alerting + batch re-aggregation nightly for the safety team's gold dashboards. Mention event-time alignment across the two paths.
日志作为通用原语
Kafka 的核心洞察(Narkhede 2011,Kleppmann 第 11 章推广):仅追加日志是统一消息、存储、状态复制的通用原语。Producer 追加,consumer 按自己的 offset 读,日志按配置保留。
- 分区:一 topic = N 个 partition,每个是带 leader 的有序日志。
- 不可变:写入后不改。支持回放、时光调试、多消费者。
- 有限保留:通常 7 天;分层存储(Confluent、Redpanda)把冷数据放对象存储,成本降 10×。
来源交叉引用
DDIA 第 10 章讲批(MapReduce、Dataflow),第 11 章讲流(日志、exactly-once、事件时间)。必须一起读。
flowchart LR
P1[Producer A] --> T[[Topic: clicks
partition 0]]
P2[Producer B] --> T2[[Topic: clicks
partition 1]]
T --> C1[消费组 X
offset=1234]
T --> C2[消费组 Y
offset=98]
T2 --> C1
T2 --> C2
C1 --> S1[(数仓)]
C2 --> S2[(实时看板)]
具体吞吐:r6i.4xlarge + NVMe 的单 broker 可持续 1 GB/s 入 + 1 GB/s 出。LinkedIn 级集群(原始)处理 7 万亿消息/天。
批处理与 MapReduce
MapReduce(Dean & Ghemawat 2004)是现代 dataflow 的祖父:map 用户函数到输入分片、按 key shuffle、按 key reduce。有界输入、有界输出,假定可重试。
- 确定性:同输入 → 同输出。重试不污染。
- 中间态物化到磁盘/HDFS:慢但稳。
- 延迟:分钟到小时,绝不面向用户实时。
现代后继——Spark、Flink batch、BigQuery——用内存 shuffle 替代磁盘,延迟降 ~5×。LLM 训练数据清洗、每日用户分析:Spark/BigQuery,不是裸 MapReduce。
流处理原语
三类算子:
- 无状态变换:filter、map。
- 窗口聚合:每 1 分钟翻滚、每 5 分钟滑动、按会话。状态 = 窗口 → 累计值。
- 流-流 join:两流按时间边界 join。状态 = 缓冲一侧 × join key。最吃内存。
状态必须持久(崩溃后保留)且可从日志恢复。Flink 用 RocksDB + 定期 S3 checkpoint;Kafka Streams 用本地 RocksDB + Kafka changelog topic。
事件时间 vs 处理时间、水位
事件时间 = 真实世界发生时间;处理时间 = 系统看到时间。飞机上移动端掉线,6 小时后上报——事件时间 6 小时前、处理时间当下。
正确流处理按事件时间聚合。但事件乱序到达:等多久?水位(Akidau 等 2015 Dataflow 论文):水位 W 意为「不再期待 timestamp < W 的事件」。窗口末超过水位则关闭。
- 水位过激进 → 迟到事件被丢,结果错。
- 水位过保守 → 窗口关得慢,延迟飙升。
- 实践:启发式水位(观测 lag 百分位)+ 允许迟到(迟到事件触发更新)+ 真正陈旧的进死信。
Exactly-once(以及它为何是个谎言)
端到端「精确一次」只有在每个副作用幂等或与 offset 提交一起事务化时才可达。DDIA 的说法:exactly-once = at-least-once + 幂等处理。
- 事务性 offset:Kafka 0.11+ 原子提交输出与消费 offset。Kafka-to-Kafka 有效。
- sink 幂等:写入带唯一 record ID,sink 去重。S3 条件写、DynamoDB
ConditionExpression、PostgresON CONFLICT DO NOTHING。 - 外部 sink 2PC:Flink TwoPhaseCommitSinkFunction——贵但 JDBC 和自定义 sink 可用。
反模式
sink 是无幂等 key 的 REST 端点却声称「exactly-once」。你只能至多一次(HTTP 前提交 offset)或至少一次(HTTP 后),不可能 exactly-once。对面试官诚实。
Lambda vs Kappa 架构
Lambda(Marz 2011):同数据上并行跑批和流;批层权威、流层低延迟近似;服务层合并。优点:各层独立调。缺点:两套代码、两套 bug、两套 on-call。
Kappa(Kreps 2014):只留流层;回补历史就回放日志。优点:一套代码。缺点:一年事件回放耗时;状态重的作业吃内存。
| 维度 | Lambda | Kappa |
|---|---|---|
| 代码库 | 2(批+流) | 1 |
| 回填 | 重跑批 | 回放日志 |
| 单作业状态 | 小 | 大 |
| 典型场景 | 老式分析 | 现代 Flink/Kafka 栈 |
何时选哪个
- 每日业务报表、过去 30 天 ML 特征表:纯批(Spark on S3 或 BigQuery)。最简、最便宜、正确。
- 实时看板、告警、分钟级个性化:流(Flink / Kafka Streams)按事件时间开窗。
- 长尾正确 + 短尾新鲜(反欺诈、滥用审核):Kappa,宽松迟到重处理,sink 幂等。
- LLM 训练数据流水线:批主导;仅连续 eval 或实时安全信号用流。
OpenAI 专属
用量计量与 token 计费是经典 Kappa 场景:每条推理事件进 Kafka topic,Flink 按 (org, model, 分钟) 聚合 token,落账本 DB,key = 幂等 (request_id, event_kind)。bug 后回放 topic 重建账本。
Anthropic 专属
安全信号(拒答率、越狱尝试、策略命中)既要低延迟又要精确历史回放。Anthropic 答法:流做告警 + 每晚批重聚为安全团队的金指标看板。两路的事件时间要对齐。