Retrieval-Augmented Generation

What RAG is and when to use it

Retrieval-Augmented Generation (RAG) grounds an LLM's answer in documents it did not see at training. The canonical interview framing: "given a corpus of N million docs, answer questions with citations, under 500ms, without finetuning the model".

Use RAG when you need: (1) up-to-date knowledge (the model's training cutoff is stale); (2) private knowledge (your wiki, Slack, customer tickets); (3) auditable citations; (4) long-tail factual precision where hallucination is unacceptable. Don't use RAG for reasoning-only tasks or when the answer is in the model's parameters anyway (simple arithmetic, general summarization).

Interview numbers: a typical enterprise corpus is 1M–100M chunks of ~500 tokens each. Embeddings at 1024-dim fp16 = 2 KB/chunk, so 100M chunks = 200 GB — fits in a single-box Milvus/Qdrant with RAM overprovisioning. End-to-end budget: 50ms retrieval, 100ms rerank, 300ms generation with streaming first-token.

Source cross-reference

Gulli's Agentic Design Patterns Ch.14 (Knowledge Retrieval) is the canonical catalog. Read the original RAG paper (Lewis et al. 2020), Contriever/ColBERT for retrieval, and HyDE (Gao et al. 2022) for query-side tricks.

Chunking: the part everyone gets wrong

Chunking is the single biggest quality lever — worse chunking beats better embeddings. Three strategies:

Fixed-size (256/512/1024 tokens with 10–20% overlap). Dead simple. Works fine for narrative text; terrible for code and structured docs.
Semantic (chunk at sentence/paragraph boundaries; cluster adjacent sentences by embedding distance). Best quality; more compute.
Structural (split at markdown headers, code function boundaries, HTML sections). The right answer for technical documentation, code, and wikis.
Sliding window with "parent doc" retrieval: store small chunks for retrieval, return the full parent section at generation time. Best of both.

Number rule: optimize chunk size to match the expected answer granularity. FAQ-style → 200 tokens. Dense technical docs → 500–800. Conversations → by turn.

Anti-pattern

One chunk size for all corpora. A single 512-token chunk cuts PDF tables in half and destroys code. Always inspect chunks by eye on 20 random documents before deploying.

Hybrid retrieval + reranking

Production RAG almost always uses hybrid retrieval:

BM25 / keyword (Elasticsearch, OpenSearch). Catches exact matches — SKUs, error codes, names. Vector search misses these.
Dense (embedding similarity, e.g., cosine over bge-large, text-embedding-3-small). Catches semantic matches.
Reciprocal Rank Fusion (RRF) merges the two ranked lists: score(d) = Σ 1/(k + rank_i(d)) with k=60 typical. Beats either alone by 5–15% on retrieval benchmarks.

Then rerank with a cross-encoder (e.g., Cohere rerank, BGE-reranker, Jina). Cross-encoders score each (query, doc) pair jointly — 100× slower than bi-encoders but 10–20% NDCG lift. Pattern: retrieve 50 candidates hybrid → rerank top 50 → keep top 5 for the LLM.

flowchart LR
  Q[User query] --> QE[Query encoder]
  Q --> QR[Query rewriter
HyDE / decomp]
  QR --> BM[BM25 index]
  QE --> VX[Vector index
HNSW]
  BM --> RRF[RRF merge]
  VX --> RRF
  RRF --> XR[Cross-encoder rerank]
  XR --> CTX[Top-5 chunks]
  CTX --> LLM[LLM + citations]
  LLM --> A[Answer + refs]

Query-side tricks: HyDE, decomposition, multi-hop

HyDE (Hypothetical Document Embeddings)

A user query "what did we decide about paid leave?" embeds poorly because it's a question, but docs are statements. HyDE: ask the LLM to draft a fake answer, then embed that. The fake doc embedding matches real doc embeddings much better. Empirically +10–15% retrieval recall at a cost of one LLM call.

Query decomposition

Multi-part queries ("compare our Q3 revenue to Anthropic's Q3 revenue") embed poorly. Have the LLM split into sub-queries, retrieve for each, union the context. Standard with tool-using agents; pairs with the planner/executor pattern (see the agentic-patterns page).

Multi-hop

"Who is the CTO of the company that acquired Figma?" requires two retrievals. Solutions: (a) LLM planner executes hop 1, then hop 2; (b) GraphRAG builds an entity graph up front so multi-hop is a graph traversal; (c) fine-tuned retriever like Self-RAG that knows when to retrieve more.

Grounding, citations, and multi-modal

Citation-grounded generation

Prompt the LLM to cite source IDs inline: [doc_id:chunk_id]. Post-process to verify every citation actually exists. For Anthropic-style interviews, explicitly discuss how you would measure grounding: train a separate NLI model to check if each output claim is entailed by the cited passage (this is what FActScore and the Bespoke-Minicheck tools do).

Long-context vs RAG

Claude 3 Opus has 200k context; Gemini 1.5 Pro has 1M. Doesn't RAG become obsolete? No — 1M-token prompts cost $3+ per query at fp16 pricing, dwarfing retrieval cost. Also, "needle in haystack" recall degrades past ~100k tokens. Long context complements RAG: retrieve coarse to fit 100k tokens, let the model sift.

Multi-modal RAG

PDFs contain tables, charts, figures. Two approaches: (1) OCR + layout parsing (LayoutLMv3, Unstructured.io), embed text per table/figure; (2) vision-language embeddings (CLIP, SigLIP, ColPali) — embed page images directly. ColPali shows that indexing page images with a late-interaction VLM beats OCR-pipelines on visual-heavy corpora.

OpenAI-specific

OpenAI's Assistants / File Search product builds RAG for you: auto-chunking (800 tokens, 400 overlap), hybrid retrieval, auto-reranking. Good for MVP, but you lose control of chunk strategy and eval — bring your own pipeline for production.

Anthropic-specific

Anthropic's "Contextual Retrieval" technique (public engineering blog) prepends a chunk-specific one-sentence context to each chunk before embedding. Example: "From the 2023 10-K, section on liquidity: [original chunk]". Reported 49% reduction in retrieval failures when combined with BM25 + rerank. This is the interview bait: cite it.

Evaluation and interview checklist

You can't improve RAG without offline eval. Minimum viable eval harness:

Retrieval metrics: hit@k, recall@k, MRR on a labeled set of 200 (query, gold-chunk) pairs.
Grounding metrics: faithfulness (does answer only use retrieved context?), citation precision.
End-to-end: LLM-as-judge on answer quality, calibrated against 50 human-rated golds. See the evaluation page.
Online: thumbs up/down, regret rate, follow-up-question rate.

Anti-patterns

No eval set. You're flying blind; every "improvement" is vibes.
Vector-only retrieval. You will miss exact-match queries. Always hybrid.
Top-k = 20 stuffed into the prompt. Cross-encoder rerank to top-5 or you dilute the LLM.
One global index for multi-tenant data. Permission leaks become compliance incidents.
Ignoring freshness. Index at write time or hourly; daily rebuilds mean stale answers.

Whiteboard checklist: corpus size and shape → chunking strategy → embedding model → hybrid index (BM25 + HNSW) → query rewrite (HyDE / decomp) → cross-encoder rerank → citation-grounded generation → eval harness → freshness pipeline → per-tenant permissions.

RAG 是什么、何时用

检索增强生成（RAG）让 LLM 基于训练时未见的文档作答。经典面试表述："给定 N 百万文档的语料，带引用回答问题，500ms 内，不微调模型"。

何时用 RAG：(1) 需要最新知识（模型训练截止）；(2) 私有知识（你的 wiki、Slack、工单）；(3) 需可审计的引用；(4) 长尾事实精准、无法容忍幻觉。不要用 RAG 做纯推理任务或答案本就在模型参数里的事（简单算术、泛摘要）。

面试数：企业典型语料 100 万-1 亿段、每段约 500 token。1024 维 fp16 embedding = 2 KB/段，1 亿段 = 200 GB——单机 Milvus/Qdrant 加内存足够。端到端预算：检索 50ms、重排 100ms、生成 300ms 首 token 流式。

参考来源

Gulli《Agentic Design Patterns》第 14 章（Knowledge Retrieval）是标准目录。读原始 RAG 论文（Lewis et al. 2020）、Contriever/ColBERT、HyDE（Gao et al. 2022）。

切块：最容易做错的地方

切块是质量最大杠杆——切得差再好的 embedding 也救不回来。三种策略：

定长（256/512/1024 token、10-20% 重叠）。极简。叙事文本 OK；代码和结构化文档差。
语义（按句/段分，再按 embedding 距离聚合相邻句）。质量最好、更费算力。
结构化（按 markdown 标题、函数边界、HTML 段）。技术文档、代码、wiki 的正解。
滑窗 + "父文档"检索：小块用于检索、生成时返回整段父节。两全。

数字规则：按预期答案颗粒度调块大小。FAQ 200 token；密集技术 500-800；对话按轮。

反模式

一个块大小适配所有语料。一刀 512 token 把 PDF 表切一半、把代码切坏。上线前必须手看 20 份随机文档的切块结果。

混合检索 + 重排

生产 RAG 几乎总用混合检索：

BM25 / 关键词（Elasticsearch、OpenSearch）。抓精确匹配——SKU、错误码、人名。向量搜会错过。
Dense（embedding 相似度，例如 bge-large、text-embedding-3-small 的余弦）。抓语义。
RRF 倒数排名融合：score(d) = Σ 1/(k + rank_i(d))、k 典型 60。比单用任一种在检索 benchmark 上好 5-15%。

然后交叉编码器重排（Cohere rerank、BGE-reranker、Jina）。对 (query, doc) 对联合打分——比双塔慢 100 倍但 NDCG 高 10-20%。流程：hybrid 召 50 → 重排 top 50 → 给 LLM top 5。

flowchart LR
  Q[用户查询] --> QE[Query encoder]
  Q --> QR[Query 重写
HyDE / 分解]
  QR --> BM[BM25 索引]
  QE --> VX[向量索引
HNSW]
  BM --> RRF[RRF 合并]
  VX --> RRF
  RRF --> XR[交叉编码器重排]
  XR --> CTX[Top-5 chunk]
  CTX --> LLM[LLM + 引用]
  LLM --> A[答案 + refs]

Query 端技巧：HyDE、分解、多跳

HyDE（Hypothetical Document Embeddings）

用户问句"关于带薪休假我们定了什么？"embed 效果差，因为是问句、文档是陈述句。HyDE：让 LLM 起草一段假答案再 embed。假文档的 embedding 能更好匹配真文档。实测召回 +10-15%，代价一次 LLM 调用。

查询分解

"比较我们 Q3 和 Anthropic Q3 收入"这种多要素查询 embed 差。让 LLM 拆子查询分别检索再合并上下文。工具型 agent 标准做法；配合 planner/executor 模式（见 agentic-patterns 页）。

多跳

"收购 Figma 的公司的 CTO 是谁？"需要两次检索。方案：(a) LLM 规划器先跳 1 再跳 2；(b) GraphRAG 预先建实体图、多跳即图遍历；(c) 微调的 Self-RAG 自知何时继续检索。

grounding、引用、多模态

引用-grounded 生成

Prompt 要求 LLM 行内标 [doc_id:chunk_id]。后处理验证每个引用都真实存在。对 Anthropic 风格面试要说明如何度量 grounding：训一个 NLI 模型检查输出声明是否被引用段落蕴含（FActScore、Bespoke-Minicheck 做此事）。

长上下文 vs RAG

Claude 3 Opus 20 万 token；Gemini 1.5 Pro 100 万。RAG 过时了？没有——100 万 token prompt 按 fp16 价每次 $3+，远超检索成本；"大海捞针"召回在 ~10 万 token 后衰减。长上下文补充 RAG：粗召 10 万 token 让模型筛。

多模态 RAG

PDF 里有表、图、图片。两路：(1) OCR + 版面解析（LayoutLMv3、Unstructured.io），按表/图单独 embed 文本；(2) 视觉-语言 embedding（CLIP、SigLIP、ColPali）直接 embed 页图。ColPali 在视觉密集语料上用 late-interaction VLM 索引页图优于 OCR pipeline。

OpenAI 细节

OpenAI 的 Assistants/File Search 内置 RAG：自动切块（800 token、400 重叠）、混合检索、自动重排。MVP 合适；生产要失去切块和评估控制，还得自己搭。

Anthropic 细节

Anthropic 的 "Contextual Retrieval"（公开工程博客）在 embed 前为每个 chunk 拼一句 chunk 专属的上下文。例如："来自 2023 10-K 流动性章节：[原 chunk]"。配合 BM25 + rerank 后检索失败率减 49%。面试可引用。

评估与面试清单

没离线评估就改不动 RAG。最小评估框架：

检索指标：200 条 (query, gold chunk) 的 hit@k、recall@k、MRR。
grounding 指标：忠实度（答案只用检索内容？）、引用准确率。
端到端：LLM-as-judge 评答案质量，与 50 条人类 gold 校准。见评估页。
线上：点赞/点踩、后悔率、追问率。

反模式

没评估集。盲飞、所有"改进"都是感觉。
只向量检索。漏精确匹配查询。永远混合。
top-k=20 塞进 prompt。必须 cross-encoder 重排到 top-5，否则稀释 LLM。
多租户共用一索引。权限泄漏 = 合规事故。
忽略新鲜度。写时索引或小时级；每日重建 = 答案过期。

白板清单：语料规模与形态 → 切块策略 → embedding 模型 → 混合索引（BM25 + HNSW）→ 查询重写（HyDE / 分解）→ 交叉编码器重排 → 引用-grounded 生成 → 评估框架 → 新鲜度管线 → 按租户权限。