Stream & Batch Processing

The log as universal primitive

Kafka's central insight (Narkhede 2011, popularized by Kleppmann's Ch.11): an append-only log is a general-purpose primitive that unifies messaging, storage, and state replication. Producers append; consumers read at their own offsets; the log retains records for a configured window.

Partitioned: one topic = N partitions; each partition is an ordered log with a single leader (backed by replication).
Immutable: once written, never rewritten. Enables replay, time-travel debugging, and multiple consumers.
Bounded retention: 7 days typical; tiered storage (Confluent, Redpanda) extends to months in object storage at 10× lower $/GB.

Source cross-reference

DDIA Ch.10 covers batch (MapReduce, Dataflow), Ch.11 covers streams (logs, exactly-once, event time). Both must be read together.

flowchart LR
    P1[Producer A] --> T[[Topic: clicks
partition 0]]
    P2[Producer B] --> T2[[Topic: clicks
partition 1]]
    T --> C1[Consumer group X
offset=1234]
    T --> C2[Consumer group Y
offset=98]
    T2 --> C1
    T2 --> C2
    C1 --> S1[(Sink: warehouse)]
    C2 --> S2[(Real-time dashboard)]

Concrete throughput: a single Kafka broker on an r6i.4xlarge with NVMe sustains 1 GB/s in + 1 GB/s out across all its partitions. A LinkedIn-scale cluster (the original) handles 7 trillion messages/day.

Batch processing and MapReduce

MapReduce (Dean & Ghemawat 2004) is the grandparent of modern dataflow: map user-provided function over input partitions, shuffle by key, reduce per-key. Reads bounded input, writes bounded output, assumes you can retry any task.

Key properties:

Deterministic: same input → same output. Enables retry on task failure without polluting results.
Materialize intermediate state to disk/HDFS. Slow but robust.
Latency: minutes to hours. Never a user-facing service.

Modern successors — Spark, Flink batch mode, BigQuery — trade disk materialization for in-memory shuffle, cutting latency ~5×. For LLM training data curation or daily user analytics, you want Spark/BigQuery, not raw MapReduce.

Stream processing primitives

Three operator families:

Stateless transforms: filter, map. Trivial.
Windowed aggregations: count per 1-minute tumbling window, per 5-minute sliding window, per session. State = window → running aggregate.
Stream-stream joins: join two event streams within a time bound. State = buffered side × join-key. Most memory-hungry.

State must be durable (survive operator crash) and recoverable from the log. Flink stores state in RocksDB + periodic checkpoint to S3. Kafka Streams uses a local RocksDB + changelog topic in Kafka itself.

Event time vs processing time, watermarks

Event time = when it happened in the real world; processing time = when your system sees it. A mobile client goes offline on a plane, sends events 6 hours later — event time is from 6 hours ago, processing time is now.

Correct streaming must aggregate by event time. But event time arrivals are out of order: how long do you wait? Enter watermarks (Akidau et al. 2015 Dataflow paper): a watermark W means "no more events with timestamp < W are expected." Windows close when the watermark passes their end.

Watermark too aggressive → late events are dropped, results are wrong.
Watermark too conservative → windows close slow, latency shoots up.
In practice: heuristic watermark (percentile of observed lag) + allowed lateness (emit an update when late events arrive) + dead letter for the truly stale.

Exactly-once (and why it is a lie)

End-to-end "exactly-once" is achievable only if every side effect is idempotent or transactional with the offset commit. DDIA's framing: exactly-once = at-least-once delivery + idempotent processing.

Mechanisms:

Transactional offsets: Kafka 0.11+ supports transactions that atomically commit output records and consumer offsets. Works for Kafka-to-Kafka.
Sink idempotency: writes include a unique record ID; the sink dedupes. S3 conditional writes, DynamoDB ConditionExpression, Postgres ON CONFLICT DO NOTHING.
Two-phase commit to an external sink: Flink's TwoPhaseCommitSinkFunction — expensive, but works for JDBC and custom sinks.

Anti-pattern

Claiming "exactly-once" when your sink is a REST endpoint with no idempotency key. You have at-most-once (if you commit offsets before the HTTP call) or at-least-once (if after); you do not have exactly-once. Be honest with the interviewer.

Lambda vs Kappa architectures

Lambda (Marz 2011): run a batch job and a streaming job in parallel over the same data; the batch layer is authoritative, the stream gives low-latency approximate results; a serving layer merges. Pro: each layer tuned independently. Con: two codebases, two bug surfaces, two on-call rotations.

Kappa (Kreps 2014): only a streaming layer; to recompute history, replay the log from the beginning. Pro: one codebase. Con: replay time for a year of events can be long; state-heavy jobs eat RAM.

Aspect	Lambda	Kappa
Codebases	2 (batch + stream)	1
Backfill	Re-run batch job	Replay from log
State size	Lower per-job	Higher (all state in stream)
Typical use	Legacy analytics	Modern Flink/Kafka stacks

When to pick what

Daily business reports, ML feature tables over last 30 days: pure batch (Spark on S3 or BigQuery). Simplest, cheapest, correct.
Real-time dashboards, alerting, sub-minute personalization: streaming (Flink / Kafka Streams) with event-time windows.
Correct long-tail + fresh short-tail (fraud detection, abuse moderation): Kappa with generous late-event reprocessing, idempotent sink.
Training-data pipelines for LLMs: batch is typical; use streaming only for continuous evals or live safety signals.

OpenAI-specific

Usage metering and token billing are a classic Kappa use case: Kafka topic of every inference event, Flink job aggregates tokens per (org, model, minute), results land in a ledger DB keyed by idempotent (request_id, event_kind). Replay the topic to rebuild ledgers after a bug.

Anthropic-specific

Safety signal pipelines (refusal rates, jailbreak attempts, policy hits) need both low latency and exact historical replay. Anthropic answers: stream processing for alerting + batch re-aggregation nightly for the safety team's gold dashboards. Mention event-time alignment across the two paths.

日志作为通用原语

Kafka 的核心洞察（Narkhede 2011，Kleppmann 第 11 章推广）：仅追加日志是统一消息、存储、状态复制的通用原语。Producer 追加，consumer 按自己的 offset 读，日志按配置保留。

分区：一 topic = N 个 partition，每个是带 leader 的有序日志。
不可变：写入后不改。支持回放、时光调试、多消费者。
有限保留：通常 7 天；分层存储（Confluent、Redpanda）把冷数据放对象存储，成本降 10×。

来源交叉引用

DDIA 第 10 章讲批（MapReduce、Dataflow），第 11 章讲流（日志、exactly-once、事件时间）。必须一起读。

flowchart LR
    P1[Producer A] --> T[[Topic: clicks
partition 0]]
    P2[Producer B] --> T2[[Topic: clicks
partition 1]]
    T --> C1[消费组 X
offset=1234]
    T --> C2[消费组 Y
offset=98]
    T2 --> C1
    T2 --> C2
    C1 --> S1[(数仓)]
    C2 --> S2[(实时看板)]

具体吞吐：r6i.4xlarge + NVMe 的单 broker 可持续 1 GB/s 入 + 1 GB/s 出。LinkedIn 级集群（原始）处理 7 万亿消息/天。

批处理与 MapReduce

MapReduce（Dean & Ghemawat 2004）是现代 dataflow 的祖父：map 用户函数到输入分片、按 key shuffle、按 key reduce。有界输入、有界输出，假定可重试。

确定性：同输入 → 同输出。重试不污染。
中间态物化到磁盘/HDFS：慢但稳。
延迟：分钟到小时，绝不面向用户实时。

现代后继——Spark、Flink batch、BigQuery——用内存 shuffle 替代磁盘，延迟降 ~5×。LLM 训练数据清洗、每日用户分析：Spark/BigQuery，不是裸 MapReduce。

流处理原语

三类算子：

无状态变换：filter、map。
窗口聚合：每 1 分钟翻滚、每 5 分钟滑动、按会话。状态 = 窗口 → 累计值。
流-流 join：两流按时间边界 join。状态 = 缓冲一侧 × join key。最吃内存。

状态必须持久（崩溃后保留）且可从日志恢复。Flink 用 RocksDB + 定期 S3 checkpoint；Kafka Streams 用本地 RocksDB + Kafka changelog topic。

事件时间 vs 处理时间、水位

事件时间 = 真实世界发生时间；处理时间 = 系统看到时间。飞机上移动端掉线，6 小时后上报——事件时间 6 小时前、处理时间当下。

正确流处理按事件时间聚合。但事件乱序到达：等多久？水位（Akidau 等 2015 Dataflow 论文）：水位 W 意为「不再期待 timestamp < W 的事件」。窗口末超过水位则关闭。

水位过激进 → 迟到事件被丢，结果错。
水位过保守 → 窗口关得慢，延迟飙升。
实践：启发式水位（观测 lag 百分位）+ 允许迟到（迟到事件触发更新）+ 真正陈旧的进死信。

Exactly-once（以及它为何是个谎言）

端到端「精确一次」只有在每个副作用幂等或与 offset 提交一起事务化时才可达。DDIA 的说法：exactly-once = at-least-once + 幂等处理。

事务性 offset：Kafka 0.11+ 原子提交输出与消费 offset。Kafka-to-Kafka 有效。
sink 幂等：写入带唯一 record ID，sink 去重。S3 条件写、DynamoDB ConditionExpression、Postgres ON CONFLICT DO NOTHING。
外部 sink 2PC：Flink TwoPhaseCommitSinkFunction——贵但 JDBC 和自定义 sink 可用。

反模式

sink 是无幂等 key 的 REST 端点却声称「exactly-once」。你只能至多一次（HTTP 前提交 offset）或至少一次（HTTP 后），不可能 exactly-once。对面试官诚实。

Lambda vs Kappa 架构

Lambda（Marz 2011）：同数据上并行跑批和流；批层权威、流层低延迟近似；服务层合并。优点：各层独立调。缺点：两套代码、两套 bug、两套 on-call。

Kappa（Kreps 2014）：只留流层；回补历史就回放日志。优点：一套代码。缺点：一年事件回放耗时；状态重的作业吃内存。

维度	Lambda	Kappa
代码库	2（批+流）	1
回填	重跑批	回放日志
单作业状态	小	大
典型场景	老式分析	现代 Flink/Kafka 栈

何时选哪个

每日业务报表、过去 30 天 ML 特征表：纯批（Spark on S3 或 BigQuery）。最简、最便宜、正确。
实时看板、告警、分钟级个性化：流（Flink / Kafka Streams）按事件时间开窗。
长尾正确 + 短尾新鲜（反欺诈、滥用审核）：Kappa，宽松迟到重处理，sink 幂等。
LLM 训练数据流水线：批主导；仅连续 eval 或实时安全信号用流。

OpenAI 专属

用量计量与 token 计费是经典 Kappa 场景：每条推理事件进 Kafka topic，Flink 按 (org, model, 分钟) 聚合 token，落账本 DB，key = 幂等 (request_id, event_kind)。bug 后回放 topic 重建账本。

Anthropic 专属

安全信号（拒答率、越狱尝试、策略命中）既要低延迟又要精确历史回放。Anthropic 答法：流做告警 + 每晚批重聚为安全团队的金指标看板。两路的事件时间要对齐。