Source cross-reference

Primary: Khang Pham Ch.2 (core ML / embeddings), Ch.3 (YouTube video recs), Ch.4 (feed ranking), Ch.7 (Airbnb similar listings), Ch.8 (LinkedIn personalized search). Cross-reference with Chip Huyen Ch.6 for offline eval metrics (NDCG, MAP).

The three-stage funnel

Every large-scale recommender at Google, Meta, LinkedIn, Pinterest, and Airbnb uses the same shape: billions of items -> thousands -> hundreds -> tens across three stages. The funnel exists because no single model can be both cheap enough to score a billion items and smart enough to rank ten of them.

flowchart LR
  corpus[(Corpus
10^9 items)] --> cg[Candidate Generation
many sources in parallel
two-tower, collab filtering, popularity] cg --> merge[Merge & dedup
~1,000 items] merge --> rank[Ranking
heavy model
wide&deep / DLRM / transformer] rank --> rerank[Re-rank
diversity, business rules
LLM-assisted] rerank --> ui[UI · top 10-50]

Per-stage latency budget for a 200 ms end-to-end feed: retrieval ≤ 40 ms, ranking ≤ 80 ms, re-rank ≤ 20 ms, network+render ≤ 60 ms. Every successful design you have seen in an interview fits this pattern.

Candidate generation (retrieval)

Goal: from 10^9 items, return O(10^3) that are worth ranking. Three families:

  1. Two-tower / dual encoder: one tower encodes the user (history, demographics), one encodes the item; train with sampled-softmax / in-batch negatives so that <user, item> ≈ score. Serve by ANN search (ScaNN, HNSW, FAISS) on the item side. This is the YouTube recs backbone (Ch.3).
  2. Collaborative filtering / matrix factorisation: older but still strong for cold-start-free domains. Ch.2 covers the primer.
  3. Rule-based / heuristic: popular-in-your-country, recency, "friends' recent posts". Cheap safety net; LinkedIn's feed (Ch.4) ensembles 10+ such generators.

Key design choices:

  • Negative sampling: random negatives are too easy; hard negatives (same session, almost clicked) dominate modern two-tower. In-batch negatives are free.
  • Embedding dim: 64-256 is typical. Larger embeddings buy accuracy and latency/storage pain.
  • ANN index: HNSW beats flat at 10^6+ items; quantised ScaNN beats HNSW at 10^8+ with 4x memory cut.

Ranking models

Ranking scores each of the ~1,000 candidates with a heavy model that can consume hundreds of features.

  • Wide & Deep (Cheng 2016, Google Play): wide linear arm memorises sparse crosses, deep MLP generalises. Still the simplest respectable baseline. Ch.5 shows the ad-CTR version.
  • DLRM (Meta, 2019): embedding tables for categoricals -> feature-interaction layer (dot products) -> MLP. Workhorse of Meta ranking; the open-source code is a great reference. Memory is dominated by embedding tables (often >100 GB).
  • DeepFM / DCN-V2: explicit cross-feature networks; replace the wide arm.
  • Transformer rankers (SASRec, BST): sequence-aware, handle long user history natively; Alibaba and Pinterest have public results.
  • Multi-task heads: one tower predicts click, watch-time, share, report; combine with a learned or hand-tuned scalarisation. Ch.3's YouTube design uses this explicitly.

Loss functions in one line

Pointwise (log-loss) for CTR; pairwise (BPR / LambdaRank) when you have implicit comparisons; listwise (softmax-CE / LambdaMART) when NDCG is the target. Airbnb's Aerosolve (Ch.7) uses pairwise hinge loss.

Re-ranking & business rules

The last stage applies constraints the ranker cannot see inside its objective:

  • Diversity: MMR (maximal marginal relevance), DPP (determinantal point process), per-category caps (e.g. "no more than 3 videos from the same creator in top 10"). Facebook's feed explicitly enforces creator diversity.
  • Freshness: time-decay boosts for new items; counteracts popularity bias.
  • Business rules: policy filters (e.g. must be available in user's country), ad-slot blending, creator ecosystem rules.
  • Calibration: rescale scores so they reflect probability, important when scores blend across systems (ads + organic).

Case patterns: YouTube, feed, Airbnb

Three canonical shapes you should be able to draw in 2 minutes:

  • YouTube (Ch.3): two-tower retrieval from billions of videos with user watch-history + context; ranking is a multi-task model optimising click + watch-time + satisfaction survey; hard negatives are sampled from the same session. 20+ retrieval sources merged.
  • Feed ranking — Facebook/LinkedIn (Ch.4, Ch.8): candidates from "friends' posts", "pages you follow", "groups", "sponsored" streams; ranker is a DLRM-style multi-task model predicting click, dwell, like, share, hide. LinkedIn layers a second-pass ranking after re-rank to optimise recruiter value.
  • Airbnb search (Ch.7): user-location + query geohash + listing features; two-tower produces listing embeddings trained with "user viewed A then booked B" pairs; ranker is gradient-boosted trees until recently, now wide-and-deep. Seasonality and host-cancellation risk are explicit features.

LLM-assisted ranking

Interview-relevant since 2024:

  • LLM as a feature: precompute item descriptions via an LLM, embed with a sentence encoder; feed as extra features to the ranker. No inference-path LLM call -> cheap and safe.
  • LLM as a re-ranker: give a top-20 shortlist plus user context to an LLM, have it output a permutation or score. 200-500 ms extra latency; only justified for high-value surfaces (e.g. flagship recommendations home).
  • LLM-generated candidates: "user is planning a trip to Kyoto, suggest experiences" -> LLM proposes 10 concepts -> resolved to real items via retrieval. Used by Airbnb and Spotify in 2024 launches.

Anti-pattern: put an LLM on every query

An LLM call at 400 ms p50 / 2 s p99 on a 100 ms feed budget is infeasible at scale. Use LLMs only (a) offline to enrich items, (b) for very top slots, (c) behind a caching layer with aggressive TTL. Always have a non-LLM fallback path.

Numbers to remember

Cheat-sheet

  • Retrieval output: 500-2,000 candidates.
  • Ranking input: up to 1,000, output: top ~50.
  • Typical two-tower embedding: 64-256 dims; item index 10^8 items -> ANN ~10 ms p99 with HNSW (M=32, efSearch=128).
  • Feed end-to-end latency budget: 100-200 ms; ranking model ≤ 80 ms on CPU with INT8 quantisation, or ~20 ms on a T4.
  • Offline metrics: NDCG@10 ≥ 0.4 is acceptable baseline on public datasets; production wins are often measured in 0.5-2% relative CTR lift per quarter.
  • A/B sample size: 1-2 weeks, ≥1M users per arm, for a 0.5% MDE on CTR at alpha=0.05, power=0.8.

OpenAI lens

OpenAI-style questions emphasise latency budgets and cost per 1k requests. Be ready to compute: "your ranker adds 30 ms and costs $0.01 per 1k requests — is that worth a 0.3% CTR lift?" Most candidates cannot answer without thinking; pre-bake a framework.

Anthropic lens

Anthropic wants recommenders that do not recommend harm. Expect probing on negative objectives (avoid showing self-harm, misinformation, PII leakage from retrieved docs), on slice-wise fairness audits, and on clear escape hatches for users to contest rankings. Treat safety as a first-class ranker objective, not an afterthought re-ranker filter.

来源对照

主源:Khang Pham Ch.2(ML 与 embedding 入门)Ch.3(YouTube 视频推荐)Ch.4(feed 排序)Ch.7(Airbnb 相似房源)Ch.8(LinkedIn 个性化搜索)。离线评估指标(NDCG、MAP)见 Chip Huyen 第 6 章。

三段漏斗

Google、Meta、LinkedIn、Pinterest、Airbnb 的大规模推荐系统全是同一个形状:十亿级 → 千级 → 百级 → 十级,三段漏斗。为什么要分段?因为没有一个模型既便宜到能给十亿个物品打分,又聪明到能把十个物品排好。

flowchart LR
  corpus[(语料
10^9 item)] --> cg[候选生成
多路并行
双塔 · 协同过滤 · 热门] cg --> merge[合并去重
~1000] merge --> rank[排序
重模型
wide&deep / DLRM / transformer] rank --> rerank[重排
多样性 · 业务规则
LLM 辅助] rerank --> ui[UI · Top 10-50]

200 ms 端到端预算下的各段延迟:召回 ≤ 40 ms,排序 ≤ 80 ms,重排 ≤ 20 ms,网络+渲染 ≤ 60 ms。你在面试里见过的所有成功设计都是这个形状。

候选生成(召回)

目标:从 10^9 个物品里取 O(10^3) 值得排序的。三种套路:

  1. 双塔 / dual encoder:一个塔编码用户(历史、画像),一个塔编码物品;用 sampled-softmax / in-batch negatives 训练,使得 <user, item> ≈ score;上线时在物品侧做 ANN(ScaNN、HNSW、FAISS)。这是 YouTube 推荐的骨干(Ch.3)。
  2. 协同过滤 / 矩阵分解:传统但在非冷启场景仍强。Ch.2 入门足够。
  3. 规则 / 启发式:所在国热门、最近、「朋友近期动态」。便宜的安全网;LinkedIn feed(Ch.4)同时集成 10+ 路。

关键决策:

  • 负样本采样:随机负太简单;hard negatives(同 session、差点就点)是现代双塔的主力。in-batch negatives 白嫖。
  • embedding 维度:64-256 常见。维度越高越准,但延迟和存储也越贵。
  • ANN 索引:10^6+ 时 HNSW 优于 flat;10^8+ 时量化 ScaNN 比 HNSW 再省 4x 内存。

排序模型

排序把 ~1000 个候选逐个用重模型打分,可吃数百个特征。

  • Wide & Deep(Cheng 2016,Google Play):宽线性臂记忆稀疏交叉,深 MLP 泛化。最简单靠谱的基线。Ch.5 讲广告 CTR 版本。
  • DLRM(Meta,2019):类别型走 embedding 表 → 特征交互层(点积)→ MLP。Meta 排序主力,开源代码值得看。内存主要被 embedding 表占(常 >100 GB)。
  • DeepFM / DCN-V2:显式交叉网络,替代宽臂。
  • Transformer 排序(SASRec、BST):对长用户历史天然;阿里、Pinterest 已公开效果。
  • 多任务头:一座塔同时预测 click、watch-time、share、report,用学到或人工配的标量化融合。Ch.3 YouTube 明确这么做。

损失函数一句话

pointwise(log-loss)用于 CTR;pairwise(BPR / LambdaRank)有隐式比较;listwise(softmax-CE / LambdaMART)目标是 NDCG。Airbnb 的 Aerosolve(Ch.7)用 pairwise hinge。

重排与业务规则

最后一段加上排序目标里不好表达的约束:

  • 多样性:MMR、DPP、分类帽子(「Top 10 里同创作者最多 3 条」)。Facebook feed 明确强制创作者多样性。
  • 新鲜度:新物品时间衰减加权,抵消流行度偏差。
  • 业务规则:合规过滤(必须在用户所在国可用)、广告位混排、创作者生态规则。
  • 校准:让打分反映概率,跨系统混排时(广告 + 有机)尤其关键。

案例:YouTube、Feed、Airbnb

三个你应该能 2 分钟画出来的经典形状:

  • YouTube(Ch.3):双塔从十亿视频召回,塔输入是用户历史 + 上下文;排序是多任务模型(click + watch-time + 满意度问卷);hard negatives 同会话采样。20+ 路召回合并。
  • Feed——Facebook/LinkedIn(Ch.4, Ch.8):候选来自「朋友动态」「关注页面」「群组」「赞助」多流;排序是 DLRM 风格多任务(click、dwell、like、share、hide)。LinkedIn 在重排之后再加一层二次排序优化招聘价值。
  • Airbnb 搜索(Ch.7):用户位置 + 查询 geohash + 房源特征;双塔用「看过 A 然后订了 B」对训练房源 embedding;排序长期是 GBDT,近几年迁到 wide&deep。季节性、房东取消风险是显式特征。

LLM 辅助排序

2024 年后面试必考:

  • LLM 当特征:离线用 LLM 给物品写描述,再用句向量 encode,作为额外特征给排序模型。推理链路没有 LLM 调用——便宜且稳。
  • LLM 当重排:把 Top-20 + 用户上下文给 LLM,让它输出一个排列或分数。额外 200-500 ms 延迟;只适合高价值位置(例如首页旗舰推荐)。
  • LLM 生成候选:「用户在计划京都旅行,建议体验项目」→ LLM 提议 10 个概念 → 用检索解析到真实物品。Airbnb、Spotify 2024 都上过。

反模式:每条请求都套 LLM

LLM 调用 p50 400 ms / p99 2 s,对 100 ms 预算的 feed 完全不可行。只在(a)离线给物品打标签,(b)最顶部位置,(c)带 TTL 缓存时使用;始终保留非 LLM 回退路径。

要背下来的数字

速查表

  • 召回输出:500-2,000 个候选。
  • 排序输入:至多 1,000,输出 Top ~50。
  • 常见双塔 embedding:64-256 维;10^8 item 规模下 HNSW(M=32, efSearch=128)p99 ≈ 10 ms。
  • Feed 端到端预算 100-200 ms;排序模型 CPU+INT8 ≤ 80 ms,T4 ≈ 20 ms。
  • 离线指标:公开数据集 NDCG@10 ≥ 0.4 算合格基线;生产胜率常以每季度 0.5-2% 相对 CTR 提升衡量。
  • A/B 样本:每臂 ≥100 万用户、1-2 周,可在 alpha=0.05、power=0.8 下检出 0.5% 的 MDE。

OpenAI 视角

OpenAI 风格的题目看重延迟预算每千请求成本。要能当场算:「这个 ranker 多加 30 ms、每千请求贵 0.01 美元——换 0.3% CTR 提升值不值?」多数候选人临场卡住;提前把框架练熟。

Anthropic 视角

Anthropic 要的是不推荐伤害的推荐器。会深挖负向目标(避免自伤、虚假信息、从检索文档里泄漏 PII)、分片公平性审计、以及让用户能申诉的透明通道。把安全当一等目标塞进 ranker,而不是在重排层打补丁。