What RAG is and when to use it

Retrieval-Augmented Generation (RAG) grounds an LLM's answer in documents it did not see at training. The canonical interview framing: "given a corpus of N million docs, answer questions with citations, under 500ms, without finetuning the model".

Use RAG when you need: (1) up-to-date knowledge (the model's training cutoff is stale); (2) private knowledge (your wiki, Slack, customer tickets); (3) auditable citations; (4) long-tail factual precision where hallucination is unacceptable. Don't use RAG for reasoning-only tasks or when the answer is in the model's parameters anyway (simple arithmetic, general summarization).

Interview numbers: a typical enterprise corpus is 1M–100M chunks of ~500 tokens each. Embeddings at 1024-dim fp16 = 2 KB/chunk, so 100M chunks = 200 GB — fits in a single-box Milvus/Qdrant with RAM overprovisioning. End-to-end budget: 50ms retrieval, 100ms rerank, 300ms generation with streaming first-token.

Source cross-reference

Gulli's Agentic Design Patterns Ch.14 (Knowledge Retrieval) is the canonical catalog. Read the original RAG paper (Lewis et al. 2020), Contriever/ColBERT for retrieval, and HyDE (Gao et al. 2022) for query-side tricks.

Chunking: the part everyone gets wrong

Chunking is the single biggest quality lever — worse chunking beats better embeddings. Three strategies:

  • Fixed-size (256/512/1024 tokens with 10–20% overlap). Dead simple. Works fine for narrative text; terrible for code and structured docs.
  • Semantic (chunk at sentence/paragraph boundaries; cluster adjacent sentences by embedding distance). Best quality; more compute.
  • Structural (split at markdown headers, code function boundaries, HTML sections). The right answer for technical documentation, code, and wikis.
  • Sliding window with "parent doc" retrieval: store small chunks for retrieval, return the full parent section at generation time. Best of both.

Number rule: optimize chunk size to match the expected answer granularity. FAQ-style → 200 tokens. Dense technical docs → 500–800. Conversations → by turn.

Anti-pattern

One chunk size for all corpora. A single 512-token chunk cuts PDF tables in half and destroys code. Always inspect chunks by eye on 20 random documents before deploying.

Hybrid retrieval + reranking

Production RAG almost always uses hybrid retrieval:

  1. BM25 / keyword (Elasticsearch, OpenSearch). Catches exact matches — SKUs, error codes, names. Vector search misses these.
  2. Dense (embedding similarity, e.g., cosine over bge-large, text-embedding-3-small). Catches semantic matches.
  3. Reciprocal Rank Fusion (RRF) merges the two ranked lists: score(d) = Σ 1/(k + rank_i(d)) with k=60 typical. Beats either alone by 5–15% on retrieval benchmarks.

Then rerank with a cross-encoder (e.g., Cohere rerank, BGE-reranker, Jina). Cross-encoders score each (query, doc) pair jointly — 100× slower than bi-encoders but 10–20% NDCG lift. Pattern: retrieve 50 candidates hybrid → rerank top 50 → keep top 5 for the LLM.

flowchart LR
  Q[User query] --> QE[Query encoder]
  Q --> QR[Query rewriter
HyDE / decomp] QR --> BM[BM25 index] QE --> VX[Vector index
HNSW] BM --> RRF[RRF merge] VX --> RRF RRF --> XR[Cross-encoder rerank] XR --> CTX[Top-5 chunks] CTX --> LLM[LLM + citations] LLM --> A[Answer + refs]

Query-side tricks: HyDE, decomposition, multi-hop

HyDE (Hypothetical Document Embeddings)

A user query "what did we decide about paid leave?" embeds poorly because it's a question, but docs are statements. HyDE: ask the LLM to draft a fake answer, then embed that. The fake doc embedding matches real doc embeddings much better. Empirically +10–15% retrieval recall at a cost of one LLM call.

Query decomposition

Multi-part queries ("compare our Q3 revenue to Anthropic's Q3 revenue") embed poorly. Have the LLM split into sub-queries, retrieve for each, union the context. Standard with tool-using agents; pairs with the planner/executor pattern (see the agentic-patterns page).

Multi-hop

"Who is the CTO of the company that acquired Figma?" requires two retrievals. Solutions: (a) LLM planner executes hop 1, then hop 2; (b) GraphRAG builds an entity graph up front so multi-hop is a graph traversal; (c) fine-tuned retriever like Self-RAG that knows when to retrieve more.

Grounding, citations, and multi-modal

Citation-grounded generation

Prompt the LLM to cite source IDs inline: [doc_id:chunk_id]. Post-process to verify every citation actually exists. For Anthropic-style interviews, explicitly discuss how you would measure grounding: train a separate NLI model to check if each output claim is entailed by the cited passage (this is what FActScore and the Bespoke-Minicheck tools do).

Long-context vs RAG

Claude 3 Opus has 200k context; Gemini 1.5 Pro has 1M. Doesn't RAG become obsolete? No — 1M-token prompts cost $3+ per query at fp16 pricing, dwarfing retrieval cost. Also, "needle in haystack" recall degrades past ~100k tokens. Long context complements RAG: retrieve coarse to fit 100k tokens, let the model sift.

Multi-modal RAG

PDFs contain tables, charts, figures. Two approaches: (1) OCR + layout parsing (LayoutLMv3, Unstructured.io), embed text per table/figure; (2) vision-language embeddings (CLIP, SigLIP, ColPali) — embed page images directly. ColPali shows that indexing page images with a late-interaction VLM beats OCR-pipelines on visual-heavy corpora.

OpenAI-specific

OpenAI's Assistants / File Search product builds RAG for you: auto-chunking (800 tokens, 400 overlap), hybrid retrieval, auto-reranking. Good for MVP, but you lose control of chunk strategy and eval — bring your own pipeline for production.

Anthropic-specific

Anthropic's "Contextual Retrieval" technique (public engineering blog) prepends a chunk-specific one-sentence context to each chunk before embedding. Example: "From the 2023 10-K, section on liquidity: [original chunk]". Reported 49% reduction in retrieval failures when combined with BM25 + rerank. This is the interview bait: cite it.

Evaluation and interview checklist

You can't improve RAG without offline eval. Minimum viable eval harness:

  • Retrieval metrics: hit@k, recall@k, MRR on a labeled set of 200 (query, gold-chunk) pairs.
  • Grounding metrics: faithfulness (does answer only use retrieved context?), citation precision.
  • End-to-end: LLM-as-judge on answer quality, calibrated against 50 human-rated golds. See the evaluation page.
  • Online: thumbs up/down, regret rate, follow-up-question rate.

Anti-patterns

  • No eval set. You're flying blind; every "improvement" is vibes.
  • Vector-only retrieval. You will miss exact-match queries. Always hybrid.
  • Top-k = 20 stuffed into the prompt. Cross-encoder rerank to top-5 or you dilute the LLM.
  • One global index for multi-tenant data. Permission leaks become compliance incidents.
  • Ignoring freshness. Index at write time or hourly; daily rebuilds mean stale answers.

Whiteboard checklist: corpus size and shape → chunking strategy → embedding model → hybrid index (BM25 + HNSW) → query rewrite (HyDE / decomp) → cross-encoder rerank → citation-grounded generation → eval harness → freshness pipeline → per-tenant permissions.

RAG 是什么、何时用

检索增强生成(RAG)让 LLM 基于训练时未见的文档作答。经典面试表述:"给定 N 百万文档的语料,带引用回答问题,500ms 内,不微调模型"。

何时用 RAG:(1) 需要最新知识(模型训练截止);(2) 私有知识(你的 wiki、Slack、工单);(3) 需可审计的引用;(4) 长尾事实精准、无法容忍幻觉。不要用 RAG 做纯推理任务或答案本就在模型参数里的事(简单算术、泛摘要)。

面试数:企业典型语料 100 万-1 亿段、每段约 500 token。1024 维 fp16 embedding = 2 KB/段,1 亿段 = 200 GB——单机 Milvus/Qdrant 加内存足够。端到端预算:检索 50ms、重排 100ms、生成 300ms 首 token 流式。

参考来源

Gulli《Agentic Design Patterns》第 14 章(Knowledge Retrieval)是标准目录。读原始 RAG 论文(Lewis et al. 2020)、Contriever/ColBERT、HyDE(Gao et al. 2022)。

切块:最容易做错的地方

切块是质量最大杠杆——切得差再好的 embedding 也救不回来。三种策略:

  • 定长(256/512/1024 token、10-20% 重叠)。极简。叙事文本 OK;代码和结构化文档差。
  • 语义(按句/段分,再按 embedding 距离聚合相邻句)。质量最好、更费算力。
  • 结构化(按 markdown 标题、函数边界、HTML 段)。技术文档、代码、wiki 的正解。
  • 滑窗 + "父文档"检索:小块用于检索、生成时返回整段父节。两全。

数字规则:按预期答案颗粒度调块大小。FAQ 200 token;密集技术 500-800;对话按轮。

反模式

一个块大小适配所有语料。一刀 512 token 把 PDF 表切一半、把代码切坏。上线前必须手看 20 份随机文档的切块结果。

混合检索 + 重排

生产 RAG 几乎总用混合检索

  1. BM25 / 关键词(Elasticsearch、OpenSearch)。抓精确匹配——SKU、错误码、人名。向量搜会错过。
  2. Dense(embedding 相似度,例如 bge-large、text-embedding-3-small 的余弦)。抓语义。
  3. RRF 倒数排名融合score(d) = Σ 1/(k + rank_i(d))、k 典型 60。比单用任一种在检索 benchmark 上好 5-15%。

然后交叉编码器重排(Cohere rerank、BGE-reranker、Jina)。对 (query, doc) 对联合打分——比双塔慢 100 倍但 NDCG 高 10-20%。流程:hybrid 召 50 → 重排 top 50 → 给 LLM top 5。

flowchart LR
  Q[用户查询] --> QE[Query encoder]
  Q --> QR[Query 重写
HyDE / 分解] QR --> BM[BM25 索引] QE --> VX[向量索引
HNSW] BM --> RRF[RRF 合并] VX --> RRF RRF --> XR[交叉编码器重排] XR --> CTX[Top-5 chunk] CTX --> LLM[LLM + 引用] LLM --> A[答案 + refs]

Query 端技巧:HyDE、分解、多跳

HyDE(Hypothetical Document Embeddings)

用户问句"关于带薪休假我们定了什么?"embed 效果差,因为是问句、文档是陈述句。HyDE:让 LLM 起草一段假答案再 embed。假文档的 embedding 能更好匹配真文档。实测召回 +10-15%,代价一次 LLM 调用。

查询分解

"比较我们 Q3 和 Anthropic Q3 收入"这种多要素查询 embed 差。让 LLM 拆子查询分别检索再合并上下文。工具型 agent 标准做法;配合 planner/executor 模式(见 agentic-patterns 页)。

多跳

"收购 Figma 的公司的 CTO 是谁?"需要两次检索。方案:(a) LLM 规划器先跳 1 再跳 2;(b) GraphRAG 预先建实体图、多跳即图遍历;(c) 微调的 Self-RAG 自知何时继续检索。

grounding、引用、多模态

引用-grounded 生成

Prompt 要求 LLM 行内标 [doc_id:chunk_id]。后处理验证每个引用都真实存在。对 Anthropic 风格面试要说明如何度量 grounding:训一个 NLI 模型检查输出声明是否被引用段落蕴含(FActScore、Bespoke-Minicheck 做此事)。

长上下文 vs RAG

Claude 3 Opus 20 万 token;Gemini 1.5 Pro 100 万。RAG 过时了?没有——100 万 token prompt 按 fp16 价每次 $3+,远超检索成本;"大海捞针"召回在 ~10 万 token 后衰减。长上下文补充 RAG:粗召 10 万 token 让模型筛。

多模态 RAG

PDF 里有表、图、图片。两路:(1) OCR + 版面解析(LayoutLMv3、Unstructured.io),按表/图单独 embed 文本;(2) 视觉-语言 embedding(CLIP、SigLIP、ColPali)直接 embed 页图。ColPali 在视觉密集语料上用 late-interaction VLM 索引页图优于 OCR pipeline。

OpenAI 细节

OpenAI 的 Assistants/File Search 内置 RAG:自动切块(800 token、400 重叠)、混合检索、自动重排。MVP 合适;生产要失去切块和评估控制,还得自己搭。

Anthropic 细节

Anthropic 的 "Contextual Retrieval"(公开工程博客)在 embed 前为每个 chunk 拼一句 chunk 专属的上下文。例如:"来自 2023 10-K 流动性章节:[原 chunk]"。配合 BM25 + rerank 后检索失败率减 49%。面试可引用。

评估与面试清单

没离线评估就改不动 RAG。最小评估框架:

  • 检索指标:200 条 (query, gold chunk) 的 hit@k、recall@k、MRR。
  • grounding 指标:忠实度(答案只用检索内容?)、引用准确率。
  • 端到端:LLM-as-judge 评答案质量,与 50 条人类 gold 校准。见评估页。
  • 线上:点赞/点踩、后悔率、追问率。

反模式

  • 没评估集。盲飞、所有"改进"都是感觉。
  • 只向量检索。漏精确匹配查询。永远混合。
  • top-k=20 塞进 prompt。必须 cross-encoder 重排到 top-5,否则稀释 LLM。
  • 多租户共用一索引。权限泄漏 = 合规事故。
  • 忽略新鲜度。写时索引或小时级;每日重建 = 答案过期。

白板清单:语料规模与形态 → 切块策略 → embedding 模型 → 混合索引(BM25 + HNSW)→ 查询重写(HyDE / 分解)→ 交叉编码器重排 → 引用-grounded 生成 → 评估框架 → 新鲜度管线 → 按租户权限。