The log as universal primitive

Kafka's central insight (Narkhede 2011, popularized by Kleppmann's Ch.11): an append-only log is a general-purpose primitive that unifies messaging, storage, and state replication. Producers append; consumers read at their own offsets; the log retains records for a configured window.

  • Partitioned: one topic = N partitions; each partition is an ordered log with a single leader (backed by replication).
  • Immutable: once written, never rewritten. Enables replay, time-travel debugging, and multiple consumers.
  • Bounded retention: 7 days typical; tiered storage (Confluent, Redpanda) extends to months in object storage at 10× lower $/GB.

Source cross-reference

DDIA Ch.10 covers batch (MapReduce, Dataflow), Ch.11 covers streams (logs, exactly-once, event time). Both must be read together.

flowchart LR
    P1[Producer A] --> T[[Topic: clicks
partition 0]] P2[Producer B] --> T2[[Topic: clicks
partition 1]] T --> C1[Consumer group X
offset=1234] T --> C2[Consumer group Y
offset=98] T2 --> C1 T2 --> C2 C1 --> S1[(Sink: warehouse)] C2 --> S2[(Real-time dashboard)]

Concrete throughput: a single Kafka broker on an r6i.4xlarge with NVMe sustains 1 GB/s in + 1 GB/s out across all its partitions. A LinkedIn-scale cluster (the original) handles 7 trillion messages/day.

Batch processing and MapReduce

MapReduce (Dean & Ghemawat 2004) is the grandparent of modern dataflow: map user-provided function over input partitions, shuffle by key, reduce per-key. Reads bounded input, writes bounded output, assumes you can retry any task.

Key properties:

  • Deterministic: same input → same output. Enables retry on task failure without polluting results.
  • Materialize intermediate state to disk/HDFS. Slow but robust.
  • Latency: minutes to hours. Never a user-facing service.

Modern successors — Spark, Flink batch mode, BigQuery — trade disk materialization for in-memory shuffle, cutting latency ~5×. For LLM training data curation or daily user analytics, you want Spark/BigQuery, not raw MapReduce.

Stream processing primitives

Three operator families:

  1. Stateless transforms: filter, map. Trivial.
  2. Windowed aggregations: count per 1-minute tumbling window, per 5-minute sliding window, per session. State = window → running aggregate.
  3. Stream-stream joins: join two event streams within a time bound. State = buffered side × join-key. Most memory-hungry.

State must be durable (survive operator crash) and recoverable from the log. Flink stores state in RocksDB + periodic checkpoint to S3. Kafka Streams uses a local RocksDB + changelog topic in Kafka itself.

Event time vs processing time, watermarks

Event time = when it happened in the real world; processing time = when your system sees it. A mobile client goes offline on a plane, sends events 6 hours later — event time is from 6 hours ago, processing time is now.

Correct streaming must aggregate by event time. But event time arrivals are out of order: how long do you wait? Enter watermarks (Akidau et al. 2015 Dataflow paper): a watermark W means "no more events with timestamp < W are expected." Windows close when the watermark passes their end.

  • Watermark too aggressive → late events are dropped, results are wrong.
  • Watermark too conservative → windows close slow, latency shoots up.
  • In practice: heuristic watermark (percentile of observed lag) + allowed lateness (emit an update when late events arrive) + dead letter for the truly stale.

Exactly-once (and why it is a lie)

End-to-end "exactly-once" is achievable only if every side effect is idempotent or transactional with the offset commit. DDIA's framing: exactly-once = at-least-once delivery + idempotent processing.

Mechanisms:

  • Transactional offsets: Kafka 0.11+ supports transactions that atomically commit output records and consumer offsets. Works for Kafka-to-Kafka.
  • Sink idempotency: writes include a unique record ID; the sink dedupes. S3 conditional writes, DynamoDB ConditionExpression, Postgres ON CONFLICT DO NOTHING.
  • Two-phase commit to an external sink: Flink's TwoPhaseCommitSinkFunction — expensive, but works for JDBC and custom sinks.

Anti-pattern

Claiming "exactly-once" when your sink is a REST endpoint with no idempotency key. You have at-most-once (if you commit offsets before the HTTP call) or at-least-once (if after); you do not have exactly-once. Be honest with the interviewer.

Lambda vs Kappa architectures

Lambda (Marz 2011): run a batch job and a streaming job in parallel over the same data; the batch layer is authoritative, the stream gives low-latency approximate results; a serving layer merges. Pro: each layer tuned independently. Con: two codebases, two bug surfaces, two on-call rotations.

Kappa (Kreps 2014): only a streaming layer; to recompute history, replay the log from the beginning. Pro: one codebase. Con: replay time for a year of events can be long; state-heavy jobs eat RAM.

AspectLambdaKappa
Codebases2 (batch + stream)1
BackfillRe-run batch jobReplay from log
State sizeLower per-jobHigher (all state in stream)
Typical useLegacy analyticsModern Flink/Kafka stacks

When to pick what

  1. Daily business reports, ML feature tables over last 30 days: pure batch (Spark on S3 or BigQuery). Simplest, cheapest, correct.
  2. Real-time dashboards, alerting, sub-minute personalization: streaming (Flink / Kafka Streams) with event-time windows.
  3. Correct long-tail + fresh short-tail (fraud detection, abuse moderation): Kappa with generous late-event reprocessing, idempotent sink.
  4. Training-data pipelines for LLMs: batch is typical; use streaming only for continuous evals or live safety signals.

OpenAI-specific

Usage metering and token billing are a classic Kappa use case: Kafka topic of every inference event, Flink job aggregates tokens per (org, model, minute), results land in a ledger DB keyed by idempotent (request_id, event_kind). Replay the topic to rebuild ledgers after a bug.

Anthropic-specific

Safety signal pipelines (refusal rates, jailbreak attempts, policy hits) need both low latency and exact historical replay. Anthropic answers: stream processing for alerting + batch re-aggregation nightly for the safety team's gold dashboards. Mention event-time alignment across the two paths.

日志作为通用原语

Kafka 的核心洞察(Narkhede 2011,Kleppmann 第 11 章推广):仅追加日志是统一消息、存储、状态复制的通用原语。Producer 追加,consumer 按自己的 offset 读,日志按配置保留。

  • 分区:一 topic = N 个 partition,每个是带 leader 的有序日志。
  • 不可变:写入后不改。支持回放、时光调试、多消费者。
  • 有限保留:通常 7 天;分层存储(Confluent、Redpanda)把冷数据放对象存储,成本降 10×。

来源交叉引用

DDIA 第 10 章讲批(MapReduce、Dataflow),第 11 章讲流(日志、exactly-once、事件时间)。必须一起读。

flowchart LR
    P1[Producer A] --> T[[Topic: clicks
partition 0]] P2[Producer B] --> T2[[Topic: clicks
partition 1]] T --> C1[消费组 X
offset=1234] T --> C2[消费组 Y
offset=98] T2 --> C1 T2 --> C2 C1 --> S1[(数仓)] C2 --> S2[(实时看板)]

具体吞吐:r6i.4xlarge + NVMe 的单 broker 可持续 1 GB/s 入 + 1 GB/s 出。LinkedIn 级集群(原始)处理 7 万亿消息/天。

批处理与 MapReduce

MapReduce(Dean & Ghemawat 2004)是现代 dataflow 的祖父:map 用户函数到输入分片、按 key shuffle、按 key reduce。有界输入、有界输出,假定可重试。

  • 确定性:同输入 → 同输出。重试不污染。
  • 中间态物化到磁盘/HDFS:慢但稳。
  • 延迟:分钟到小时,绝不面向用户实时。

现代后继——Spark、Flink batch、BigQuery——用内存 shuffle 替代磁盘,延迟降 ~5×。LLM 训练数据清洗、每日用户分析:Spark/BigQuery,不是裸 MapReduce。

流处理原语

三类算子:

  1. 无状态变换:filter、map。
  2. 窗口聚合:每 1 分钟翻滚、每 5 分钟滑动、按会话。状态 = 窗口 → 累计值。
  3. 流-流 join:两流按时间边界 join。状态 = 缓冲一侧 × join key。最吃内存。

状态必须持久(崩溃后保留)且可从日志恢复。Flink 用 RocksDB + 定期 S3 checkpoint;Kafka Streams 用本地 RocksDB + Kafka changelog topic。

事件时间 vs 处理时间、水位

事件时间 = 真实世界发生时间;处理时间 = 系统看到时间。飞机上移动端掉线,6 小时后上报——事件时间 6 小时前、处理时间当下。

正确流处理按事件时间聚合。但事件乱序到达:等多久?水位(Akidau 等 2015 Dataflow 论文):水位 W 意为「不再期待 timestamp < W 的事件」。窗口末超过水位则关闭。

  • 水位过激进 → 迟到事件被丢,结果错。
  • 水位过保守 → 窗口关得慢,延迟飙升。
  • 实践:启发式水位(观测 lag 百分位)+ 允许迟到(迟到事件触发更新)+ 真正陈旧的进死信

Exactly-once(以及它为何是个谎言)

端到端「精确一次」只有在每个副作用幂等或与 offset 提交一起事务化时才可达。DDIA 的说法:exactly-once = at-least-once + 幂等处理。

  • 事务性 offset:Kafka 0.11+ 原子提交输出与消费 offset。Kafka-to-Kafka 有效。
  • sink 幂等:写入带唯一 record ID,sink 去重。S3 条件写、DynamoDB ConditionExpression、Postgres ON CONFLICT DO NOTHING
  • 外部 sink 2PC:Flink TwoPhaseCommitSinkFunction——贵但 JDBC 和自定义 sink 可用。

反模式

sink 是无幂等 key 的 REST 端点却声称「exactly-once」。你只能至多一次(HTTP 前提交 offset)或至少一次(HTTP 后),不可能 exactly-once。对面试官诚实。

Lambda vs Kappa 架构

Lambda(Marz 2011):同数据上并行跑批和流;批层权威、流层低延迟近似;服务层合并。优点:各层独立调。缺点:两套代码、两套 bug、两套 on-call。

Kappa(Kreps 2014):只留流层;回补历史就回放日志。优点:一套代码。缺点:一年事件回放耗时;状态重的作业吃内存。

维度LambdaKappa
代码库2(批+流)1
回填重跑批回放日志
单作业状态
典型场景老式分析现代 Flink/Kafka 栈

何时选哪个

  1. 每日业务报表、过去 30 天 ML 特征表:纯批(Spark on S3 或 BigQuery)。最简、最便宜、正确。
  2. 实时看板、告警、分钟级个性化:流(Flink / Kafka Streams)按事件时间开窗。
  3. 长尾正确 + 短尾新鲜(反欺诈、滥用审核):Kappa,宽松迟到重处理,sink 幂等。
  4. LLM 训练数据流水线:批主导;仅连续 eval 或实时安全信号用流。

OpenAI 专属

用量计量与 token 计费是经典 Kappa 场景:每条推理事件进 Kafka topic,Flink 按 (org, model, 分钟) 聚合 token,落账本 DB,key = 幂等 (request_id, event_kind)。bug 后回放 topic 重建账本。

Anthropic 专属

安全信号(拒答率、越狱尝试、策略命中)既要低延迟又要精确历史回放。Anthropic 答法:流做告警 + 每晚批重聚为安全团队的金指标看板。两路的事件时间要对齐。