Source cross-reference
Primary: Khang Pham Ch.2 (core ML / embeddings), Ch.3 (YouTube video recs), Ch.4 (feed ranking), Ch.7 (Airbnb similar listings), Ch.8 (LinkedIn personalized search). Cross-reference with Chip Huyen Ch.6 for offline eval metrics (NDCG, MAP).
The three-stage funnel
Every large-scale recommender at Google, Meta, LinkedIn, Pinterest, and Airbnb uses the same shape: billions of items -> thousands -> hundreds -> tens across three stages. The funnel exists because no single model can be both cheap enough to score a billion items and smart enough to rank ten of them.
flowchart LR corpus[(Corpus
10^9 items)] --> cg[Candidate Generation
many sources in parallel
two-tower, collab filtering, popularity] cg --> merge[Merge & dedup
~1,000 items] merge --> rank[Ranking
heavy model
wide&deep / DLRM / transformer] rank --> rerank[Re-rank
diversity, business rules
LLM-assisted] rerank --> ui[UI · top 10-50]
Per-stage latency budget for a 200 ms end-to-end feed: retrieval ≤ 40 ms, ranking ≤ 80 ms, re-rank ≤ 20 ms, network+render ≤ 60 ms. Every successful design you have seen in an interview fits this pattern.
Candidate generation (retrieval)
Goal: from 10^9 items, return O(10^3) that are worth ranking. Three families:
- Two-tower / dual encoder: one tower encodes the user (history, demographics), one encodes the item; train with sampled-softmax / in-batch negatives so that
<user, item> ≈ score. Serve by ANN search (ScaNN, HNSW, FAISS) on the item side. This is the YouTube recs backbone (Ch.3). - Collaborative filtering / matrix factorisation: older but still strong for cold-start-free domains. Ch.2 covers the primer.
- Rule-based / heuristic: popular-in-your-country, recency, "friends' recent posts". Cheap safety net; LinkedIn's feed (Ch.4) ensembles 10+ such generators.
Key design choices:
- Negative sampling: random negatives are too easy; hard negatives (same session, almost clicked) dominate modern two-tower. In-batch negatives are free.
- Embedding dim: 64-256 is typical. Larger embeddings buy accuracy and latency/storage pain.
- ANN index: HNSW beats flat at 10^6+ items; quantised ScaNN beats HNSW at 10^8+ with 4x memory cut.
Ranking models
Ranking scores each of the ~1,000 candidates with a heavy model that can consume hundreds of features.
- Wide & Deep (Cheng 2016, Google Play): wide linear arm memorises sparse crosses, deep MLP generalises. Still the simplest respectable baseline. Ch.5 shows the ad-CTR version.
- DLRM (Meta, 2019): embedding tables for categoricals -> feature-interaction layer (dot products) -> MLP. Workhorse of Meta ranking; the open-source code is a great reference. Memory is dominated by embedding tables (often >100 GB).
- DeepFM / DCN-V2: explicit cross-feature networks; replace the wide arm.
- Transformer rankers (SASRec, BST): sequence-aware, handle long user history natively; Alibaba and Pinterest have public results.
- Multi-task heads: one tower predicts click, watch-time, share, report; combine with a learned or hand-tuned scalarisation. Ch.3's YouTube design uses this explicitly.
Loss functions in one line
Pointwise (log-loss) for CTR; pairwise (BPR / LambdaRank) when you have implicit comparisons; listwise (softmax-CE / LambdaMART) when NDCG is the target. Airbnb's Aerosolve (Ch.7) uses pairwise hinge loss.
Re-ranking & business rules
The last stage applies constraints the ranker cannot see inside its objective:
- Diversity: MMR (maximal marginal relevance), DPP (determinantal point process), per-category caps (e.g. "no more than 3 videos from the same creator in top 10"). Facebook's feed explicitly enforces creator diversity.
- Freshness: time-decay boosts for new items; counteracts popularity bias.
- Business rules: policy filters (e.g. must be available in user's country), ad-slot blending, creator ecosystem rules.
- Calibration: rescale scores so they reflect probability, important when scores blend across systems (ads + organic).
Case patterns: YouTube, feed, Airbnb
Three canonical shapes you should be able to draw in 2 minutes:
- YouTube (Ch.3): two-tower retrieval from billions of videos with user watch-history + context; ranking is a multi-task model optimising click + watch-time + satisfaction survey; hard negatives are sampled from the same session. 20+ retrieval sources merged.
- Feed ranking — Facebook/LinkedIn (Ch.4, Ch.8): candidates from "friends' posts", "pages you follow", "groups", "sponsored" streams; ranker is a DLRM-style multi-task model predicting click, dwell, like, share, hide. LinkedIn layers a second-pass ranking after re-rank to optimise recruiter value.
- Airbnb search (Ch.7): user-location + query geohash + listing features; two-tower produces listing embeddings trained with "user viewed A then booked B" pairs; ranker is gradient-boosted trees until recently, now wide-and-deep. Seasonality and host-cancellation risk are explicit features.
LLM-assisted ranking
Interview-relevant since 2024:
- LLM as a feature: precompute item descriptions via an LLM, embed with a sentence encoder; feed as extra features to the ranker. No inference-path LLM call -> cheap and safe.
- LLM as a re-ranker: give a top-20 shortlist plus user context to an LLM, have it output a permutation or score. 200-500 ms extra latency; only justified for high-value surfaces (e.g. flagship recommendations home).
- LLM-generated candidates: "user is planning a trip to Kyoto, suggest experiences" -> LLM proposes 10 concepts -> resolved to real items via retrieval. Used by Airbnb and Spotify in 2024 launches.
Anti-pattern: put an LLM on every query
An LLM call at 400 ms p50 / 2 s p99 on a 100 ms feed budget is infeasible at scale. Use LLMs only (a) offline to enrich items, (b) for very top slots, (c) behind a caching layer with aggressive TTL. Always have a non-LLM fallback path.
Numbers to remember
Cheat-sheet
- Retrieval output: 500-2,000 candidates.
- Ranking input: up to 1,000, output: top ~50.
- Typical two-tower embedding: 64-256 dims; item index 10^8 items -> ANN ~10 ms p99 with HNSW (M=32, efSearch=128).
- Feed end-to-end latency budget: 100-200 ms; ranking model ≤ 80 ms on CPU with INT8 quantisation, or ~20 ms on a T4.
- Offline metrics: NDCG@10 ≥ 0.4 is acceptable baseline on public datasets; production wins are often measured in 0.5-2% relative CTR lift per quarter.
- A/B sample size: 1-2 weeks, ≥1M users per arm, for a 0.5% MDE on CTR at alpha=0.05, power=0.8.
OpenAI lens
OpenAI-style questions emphasise latency budgets and cost per 1k requests. Be ready to compute: "your ranker adds 30 ms and costs $0.01 per 1k requests — is that worth a 0.3% CTR lift?" Most candidates cannot answer without thinking; pre-bake a framework.
Anthropic lens
Anthropic wants recommenders that do not recommend harm. Expect probing on negative objectives (avoid showing self-harm, misinformation, PII leakage from retrieved docs), on slice-wise fairness audits, and on clear escape hatches for users to contest rankings. Treat safety as a first-class ranker objective, not an afterthought re-ranker filter.
来源对照
主源:Khang Pham Ch.2(ML 与 embedding 入门)、Ch.3(YouTube 视频推荐)、Ch.4(feed 排序)、Ch.7(Airbnb 相似房源)、Ch.8(LinkedIn 个性化搜索)。离线评估指标(NDCG、MAP)见 Chip Huyen 第 6 章。
三段漏斗
Google、Meta、LinkedIn、Pinterest、Airbnb 的大规模推荐系统全是同一个形状:十亿级 → 千级 → 百级 → 十级,三段漏斗。为什么要分段?因为没有一个模型既便宜到能给十亿个物品打分,又聪明到能把十个物品排好。
flowchart LR corpus[(语料
10^9 item)] --> cg[候选生成
多路并行
双塔 · 协同过滤 · 热门] cg --> merge[合并去重
~1000] merge --> rank[排序
重模型
wide&deep / DLRM / transformer] rank --> rerank[重排
多样性 · 业务规则
LLM 辅助] rerank --> ui[UI · Top 10-50]
200 ms 端到端预算下的各段延迟:召回 ≤ 40 ms,排序 ≤ 80 ms,重排 ≤ 20 ms,网络+渲染 ≤ 60 ms。你在面试里见过的所有成功设计都是这个形状。
候选生成(召回)
目标:从 10^9 个物品里取 O(10^3) 值得排序的。三种套路:
- 双塔 / dual encoder:一个塔编码用户(历史、画像),一个塔编码物品;用 sampled-softmax / in-batch negatives 训练,使得
<user, item> ≈ score;上线时在物品侧做 ANN(ScaNN、HNSW、FAISS)。这是 YouTube 推荐的骨干(Ch.3)。 - 协同过滤 / 矩阵分解:传统但在非冷启场景仍强。Ch.2 入门足够。
- 规则 / 启发式:所在国热门、最近、「朋友近期动态」。便宜的安全网;LinkedIn feed(Ch.4)同时集成 10+ 路。
关键决策:
- 负样本采样:随机负太简单;hard negatives(同 session、差点就点)是现代双塔的主力。in-batch negatives 白嫖。
- embedding 维度:64-256 常见。维度越高越准,但延迟和存储也越贵。
- ANN 索引:10^6+ 时 HNSW 优于 flat;10^8+ 时量化 ScaNN 比 HNSW 再省 4x 内存。
排序模型
排序把 ~1000 个候选逐个用重模型打分,可吃数百个特征。
- Wide & Deep(Cheng 2016,Google Play):宽线性臂记忆稀疏交叉,深 MLP 泛化。最简单靠谱的基线。Ch.5 讲广告 CTR 版本。
- DLRM(Meta,2019):类别型走 embedding 表 → 特征交互层(点积)→ MLP。Meta 排序主力,开源代码值得看。内存主要被 embedding 表占(常 >100 GB)。
- DeepFM / DCN-V2:显式交叉网络,替代宽臂。
- Transformer 排序(SASRec、BST):对长用户历史天然;阿里、Pinterest 已公开效果。
- 多任务头:一座塔同时预测 click、watch-time、share、report,用学到或人工配的标量化融合。Ch.3 YouTube 明确这么做。
损失函数一句话
pointwise(log-loss)用于 CTR;pairwise(BPR / LambdaRank)有隐式比较;listwise(softmax-CE / LambdaMART)目标是 NDCG。Airbnb 的 Aerosolve(Ch.7)用 pairwise hinge。
重排与业务规则
最后一段加上排序目标里不好表达的约束:
- 多样性:MMR、DPP、分类帽子(「Top 10 里同创作者最多 3 条」)。Facebook feed 明确强制创作者多样性。
- 新鲜度:新物品时间衰减加权,抵消流行度偏差。
- 业务规则:合规过滤(必须在用户所在国可用)、广告位混排、创作者生态规则。
- 校准:让打分反映概率,跨系统混排时(广告 + 有机)尤其关键。
案例:YouTube、Feed、Airbnb
三个你应该能 2 分钟画出来的经典形状:
- YouTube(Ch.3):双塔从十亿视频召回,塔输入是用户历史 + 上下文;排序是多任务模型(click + watch-time + 满意度问卷);hard negatives 同会话采样。20+ 路召回合并。
- Feed——Facebook/LinkedIn(Ch.4, Ch.8):候选来自「朋友动态」「关注页面」「群组」「赞助」多流;排序是 DLRM 风格多任务(click、dwell、like、share、hide)。LinkedIn 在重排之后再加一层二次排序优化招聘价值。
- Airbnb 搜索(Ch.7):用户位置 + 查询 geohash + 房源特征;双塔用「看过 A 然后订了 B」对训练房源 embedding;排序长期是 GBDT,近几年迁到 wide&deep。季节性、房东取消风险是显式特征。
LLM 辅助排序
2024 年后面试必考:
- LLM 当特征:离线用 LLM 给物品写描述,再用句向量 encode,作为额外特征给排序模型。推理链路没有 LLM 调用——便宜且稳。
- LLM 当重排:把 Top-20 + 用户上下文给 LLM,让它输出一个排列或分数。额外 200-500 ms 延迟;只适合高价值位置(例如首页旗舰推荐)。
- LLM 生成候选:「用户在计划京都旅行,建议体验项目」→ LLM 提议 10 个概念 → 用检索解析到真实物品。Airbnb、Spotify 2024 都上过。
反模式:每条请求都套 LLM
LLM 调用 p50 400 ms / p99 2 s,对 100 ms 预算的 feed 完全不可行。只在(a)离线给物品打标签,(b)最顶部位置,(c)带 TTL 缓存时使用;始终保留非 LLM 回退路径。
要背下来的数字
速查表
- 召回输出:500-2,000 个候选。
- 排序输入:至多 1,000,输出 Top ~50。
- 常见双塔 embedding:64-256 维;10^8 item 规模下 HNSW(M=32, efSearch=128)p99 ≈ 10 ms。
- Feed 端到端预算 100-200 ms;排序模型 CPU+INT8 ≤ 80 ms,T4 ≈ 20 ms。
- 离线指标:公开数据集 NDCG@10 ≥ 0.4 算合格基线;生产胜率常以每季度 0.5-2% 相对 CTR 提升衡量。
- A/B 样本:每臂 ≥100 万用户、1-2 周,可在 alpha=0.05、power=0.8 下检出 0.5% 的 MDE。
OpenAI 视角
OpenAI 风格的题目看重延迟预算和每千请求成本。要能当场算:「这个 ranker 多加 30 ms、每千请求贵 0.01 美元——换 0.3% CTR 提升值不值?」多数候选人临场卡住;提前把框架练熟。
Anthropic 视角
Anthropic 要的是不推荐伤害的推荐器。会深挖负向目标(避免自伤、虚假信息、从检索文档里泄漏 PII)、分片公平性审计、以及让用户能申诉的透明通道。把安全当一等目标塞进 ranker,而不是在重排层打补丁。