Recommenders & Ranking

Source cross-reference

Primary: Khang Pham Ch.2 (core ML / embeddings), Ch.3 (YouTube video recs), Ch.4 (feed ranking), Ch.7 (Airbnb similar listings), Ch.8 (LinkedIn personalized search). Cross-reference with Chip Huyen Ch.6 for offline eval metrics (NDCG, MAP).

The three-stage funnel

Every large-scale recommender at Google, Meta, LinkedIn, Pinterest, and Airbnb uses the same shape: billions of items -> thousands -> hundreds -> tens across three stages. The funnel exists because no single model can be both cheap enough to score a billion items and smart enough to rank ten of them.

flowchart LR
  corpus[(Corpus
10^9 items)] --> cg[Candidate Generation
many sources in parallel
two-tower, collab filtering, popularity]
  cg --> merge[Merge & dedup
~1,000 items]
  merge --> rank[Ranking
heavy model
wide&deep / DLRM / transformer]
  rank --> rerank[Re-rank
diversity, business rules
LLM-assisted]
  rerank --> ui[UI · top 10-50]

Per-stage latency budget for a 200 ms end-to-end feed: retrieval ≤ 40 ms, ranking ≤ 80 ms, re-rank ≤ 20 ms, network+render ≤ 60 ms. Every successful design you have seen in an interview fits this pattern.

Candidate generation (retrieval)

Goal: from 10^9 items, return O(10^3) that are worth ranking. Three families:

Two-tower / dual encoder: one tower encodes the user (history, demographics), one encodes the item; train with sampled-softmax / in-batch negatives so that <user, item> ≈ score. Serve by ANN search (ScaNN, HNSW, FAISS) on the item side. This is the YouTube recs backbone (Ch.3).
Collaborative filtering / matrix factorisation: older but still strong for cold-start-free domains. Ch.2 covers the primer.
Rule-based / heuristic: popular-in-your-country, recency, "friends' recent posts". Cheap safety net; LinkedIn's feed (Ch.4) ensembles 10+ such generators.

Key design choices:

Negative sampling: random negatives are too easy; hard negatives (same session, almost clicked) dominate modern two-tower. In-batch negatives are free.
Embedding dim: 64-256 is typical. Larger embeddings buy accuracy and latency/storage pain.
ANN index: HNSW beats flat at 10^6+ items; quantised ScaNN beats HNSW at 10^8+ with 4x memory cut.

Ranking models

Ranking scores each of the ~1,000 candidates with a heavy model that can consume hundreds of features.

Wide & Deep (Cheng 2016, Google Play): wide linear arm memorises sparse crosses, deep MLP generalises. Still the simplest respectable baseline. Ch.5 shows the ad-CTR version.
DLRM (Meta, 2019): embedding tables for categoricals -> feature-interaction layer (dot products) -> MLP. Workhorse of Meta ranking; the open-source code is a great reference. Memory is dominated by embedding tables (often >100 GB).
DeepFM / DCN-V2: explicit cross-feature networks; replace the wide arm.
Transformer rankers (SASRec, BST): sequence-aware, handle long user history natively; Alibaba and Pinterest have public results.
Multi-task heads: one tower predicts click, watch-time, share, report; combine with a learned or hand-tuned scalarisation. Ch.3's YouTube design uses this explicitly.

Loss functions in one line

Pointwise (log-loss) for CTR; pairwise (BPR / LambdaRank) when you have implicit comparisons; listwise (softmax-CE / LambdaMART) when NDCG is the target. Airbnb's Aerosolve (Ch.7) uses pairwise hinge loss.

Re-ranking & business rules

The last stage applies constraints the ranker cannot see inside its objective:

Diversity: MMR (maximal marginal relevance), DPP (determinantal point process), per-category caps (e.g. "no more than 3 videos from the same creator in top 10"). Facebook's feed explicitly enforces creator diversity.
Freshness: time-decay boosts for new items; counteracts popularity bias.
Business rules: policy filters (e.g. must be available in user's country), ad-slot blending, creator ecosystem rules.
Calibration: rescale scores so they reflect probability, important when scores blend across systems (ads + organic).

Case patterns: YouTube, feed, Airbnb

Three canonical shapes you should be able to draw in 2 minutes:

YouTube (Ch.3): two-tower retrieval from billions of videos with user watch-history + context; ranking is a multi-task model optimising click + watch-time + satisfaction survey; hard negatives are sampled from the same session. 20+ retrieval sources merged.
Feed ranking — Facebook/LinkedIn (Ch.4, Ch.8): candidates from "friends' posts", "pages you follow", "groups", "sponsored" streams; ranker is a DLRM-style multi-task model predicting click, dwell, like, share, hide. LinkedIn layers a second-pass ranking after re-rank to optimise recruiter value.
Airbnb search (Ch.7): user-location + query geohash + listing features; two-tower produces listing embeddings trained with "user viewed A then booked B" pairs; ranker is gradient-boosted trees until recently, now wide-and-deep. Seasonality and host-cancellation risk are explicit features.

LLM-assisted ranking

Interview-relevant since 2024:

LLM as a feature: precompute item descriptions via an LLM, embed with a sentence encoder; feed as extra features to the ranker. No inference-path LLM call -> cheap and safe.
LLM as a re-ranker: give a top-20 shortlist plus user context to an LLM, have it output a permutation or score. 200-500 ms extra latency; only justified for high-value surfaces (e.g. flagship recommendations home).
LLM-generated candidates: "user is planning a trip to Kyoto, suggest experiences" -> LLM proposes 10 concepts -> resolved to real items via retrieval. Used by Airbnb and Spotify in 2024 launches.

Anti-pattern: put an LLM on every query

An LLM call at 400 ms p50 / 2 s p99 on a 100 ms feed budget is infeasible at scale. Use LLMs only (a) offline to enrich items, (b) for very top slots, (c) behind a caching layer with aggressive TTL. Always have a non-LLM fallback path.

Numbers to remember

Cheat-sheet

Retrieval output: 500-2,000 candidates.
Ranking input: up to 1,000, output: top ~50.
Typical two-tower embedding: 64-256 dims; item index 10^8 items -> ANN ~10 ms p99 with HNSW (M=32, efSearch=128).
Feed end-to-end latency budget: 100-200 ms; ranking model ≤ 80 ms on CPU with INT8 quantisation, or ~20 ms on a T4.
Offline metrics: NDCG@10 ≥ 0.4 is acceptable baseline on public datasets; production wins are often measured in 0.5-2% relative CTR lift per quarter.
A/B sample size: 1-2 weeks, ≥1M users per arm, for a 0.5% MDE on CTR at alpha=0.05, power=0.8.

OpenAI lens

OpenAI-style questions emphasise latency budgets and cost per 1k requests. Be ready to compute: "your ranker adds 30 ms and costs $0.01 per 1k requests — is that worth a 0.3% CTR lift?" Most candidates cannot answer without thinking; pre-bake a framework.

Anthropic lens

Anthropic wants recommenders that do not recommend harm. Expect probing on negative objectives (avoid showing self-harm, misinformation, PII leakage from retrieved docs), on slice-wise fairness audits, and on clear escape hatches for users to contest rankings. Treat safety as a first-class ranker objective, not an afterthought re-ranker filter.

来源对照

主源：Khang Pham Ch.2（ML 与 embedding 入门）、Ch.3（YouTube 视频推荐）、Ch.4（feed 排序）、Ch.7（Airbnb 相似房源）、Ch.8（LinkedIn 个性化搜索）。离线评估指标（NDCG、MAP）见 Chip Huyen 第 6 章。

三段漏斗

Google、Meta、LinkedIn、Pinterest、Airbnb 的大规模推荐系统全是同一个形状：十亿级 → 千级 → 百级 → 十级，三段漏斗。为什么要分段？因为没有一个模型既便宜到能给十亿个物品打分，又聪明到能把十个物品排好。

flowchart LR
  corpus[(语料
10^9 item)] --> cg[候选生成
多路并行
双塔 · 协同过滤 · 热门]
  cg --> merge[合并去重
~1000]
  merge --> rank[排序
重模型
wide&deep / DLRM / transformer]
  rank --> rerank[重排
多样性 · 业务规则
LLM 辅助]
  rerank --> ui[UI · Top 10-50]

200 ms 端到端预算下的各段延迟：召回 ≤ 40 ms，排序 ≤ 80 ms，重排 ≤ 20 ms，网络+渲染 ≤ 60 ms。你在面试里见过的所有成功设计都是这个形状。

候选生成（召回）

目标：从 10^9 个物品里取 O(10^3) 值得排序的。三种套路：

双塔 / dual encoder：一个塔编码用户（历史、画像），一个塔编码物品；用 sampled-softmax / in-batch negatives 训练，使得 <user, item> ≈ score；上线时在物品侧做 ANN（ScaNN、HNSW、FAISS）。这是 YouTube 推荐的骨干（Ch.3）。
协同过滤 / 矩阵分解：传统但在非冷启场景仍强。Ch.2 入门足够。
规则 / 启发式：所在国热门、最近、「朋友近期动态」。便宜的安全网；LinkedIn feed（Ch.4）同时集成 10+ 路。

关键决策：

负样本采样：随机负太简单；hard negatives（同 session、差点就点）是现代双塔的主力。in-batch negatives 白嫖。
embedding 维度：64-256 常见。维度越高越准，但延迟和存储也越贵。
ANN 索引：10^6+ 时 HNSW 优于 flat；10^8+ 时量化 ScaNN 比 HNSW 再省 4x 内存。

排序模型

排序把 ~1000 个候选逐个用重模型打分，可吃数百个特征。

Wide & Deep（Cheng 2016，Google Play）：宽线性臂记忆稀疏交叉，深 MLP 泛化。最简单靠谱的基线。Ch.5 讲广告 CTR 版本。
DLRM（Meta，2019）：类别型走 embedding 表 → 特征交互层（点积）→ MLP。Meta 排序主力，开源代码值得看。内存主要被 embedding 表占（常 >100 GB）。
DeepFM / DCN-V2：显式交叉网络，替代宽臂。
Transformer 排序（SASRec、BST）：对长用户历史天然；阿里、Pinterest 已公开效果。
多任务头：一座塔同时预测 click、watch-time、share、report，用学到或人工配的标量化融合。Ch.3 YouTube 明确这么做。

损失函数一句话

pointwise（log-loss）用于 CTR；pairwise（BPR / LambdaRank）有隐式比较；listwise（softmax-CE / LambdaMART）目标是 NDCG。Airbnb 的 Aerosolve（Ch.7）用 pairwise hinge。

重排与业务规则

最后一段加上排序目标里不好表达的约束：

多样性：MMR、DPP、分类帽子（「Top 10 里同创作者最多 3 条」）。Facebook feed 明确强制创作者多样性。
新鲜度：新物品时间衰减加权，抵消流行度偏差。
业务规则：合规过滤（必须在用户所在国可用）、广告位混排、创作者生态规则。
校准：让打分反映概率，跨系统混排时（广告 + 有机）尤其关键。

案例：YouTube、Feed、Airbnb

三个你应该能 2 分钟画出来的经典形状：

YouTube（Ch.3）：双塔从十亿视频召回，塔输入是用户历史 + 上下文；排序是多任务模型（click + watch-time + 满意度问卷）；hard negatives 同会话采样。20+ 路召回合并。
Feed——Facebook/LinkedIn（Ch.4, Ch.8）：候选来自「朋友动态」「关注页面」「群组」「赞助」多流；排序是 DLRM 风格多任务（click、dwell、like、share、hide）。LinkedIn 在重排之后再加一层二次排序优化招聘价值。
Airbnb 搜索（Ch.7）：用户位置 + 查询 geohash + 房源特征；双塔用「看过 A 然后订了 B」对训练房源 embedding；排序长期是 GBDT，近几年迁到 wide&deep。季节性、房东取消风险是显式特征。

LLM 辅助排序

2024 年后面试必考：

LLM 当特征：离线用 LLM 给物品写描述，再用句向量 encode，作为额外特征给排序模型。推理链路没有 LLM 调用——便宜且稳。
LLM 当重排：把 Top-20 + 用户上下文给 LLM，让它输出一个排列或分数。额外 200-500 ms 延迟；只适合高价值位置（例如首页旗舰推荐）。
LLM 生成候选：「用户在计划京都旅行，建议体验项目」→ LLM 提议 10 个概念 → 用检索解析到真实物品。Airbnb、Spotify 2024 都上过。

反模式：每条请求都套 LLM

LLM 调用 p50 400 ms / p99 2 s，对 100 ms 预算的 feed 完全不可行。只在（a）离线给物品打标签，（b）最顶部位置，（c）带 TTL 缓存时使用；始终保留非 LLM 回退路径。

要背下来的数字

速查表

召回输出：500-2,000 个候选。
排序输入：至多 1,000，输出 Top ~50。
常见双塔 embedding：64-256 维；10^8 item 规模下 HNSW（M=32, efSearch=128）p99 ≈ 10 ms。
Feed 端到端预算 100-200 ms；排序模型 CPU+INT8 ≤ 80 ms，T4 ≈ 20 ms。
离线指标：公开数据集 NDCG@10 ≥ 0.4 算合格基线；生产胜率常以每季度 0.5-2% 相对 CTR 提升衡量。
A/B 样本：每臂 ≥100 万用户、1-2 周，可在 alpha=0.05、power=0.8 下检出 0.5% 的 MDE。

OpenAI 视角

OpenAI 风格的题目看重延迟预算和每千请求成本。要能当场算：「这个 ranker 多加 30 ms、每千请求贵 0.01 美元——换 0.3% CTR 提升值不值？」多数候选人临场卡住；提前把框架练熟。

Anthropic 视角

Anthropic 要的是不推荐伤害的推荐器。会深挖负向目标（避免自伤、虚假信息、从检索文档里泄漏 PII）、分片公平性审计、以及让用户能申诉的透明通道。把安全当一等目标塞进 ranker，而不是在重排层打补丁。