LLM Evaluation — SD-Guide

Why LLM eval is hard

Classical ML eval is "compute loss on held-out test set". LLMs break this in three ways:

Open-ended outputs. There is no single correct string — "Paris is the capital of France" and "France's capital is Paris" are both right. Exact-match and BLEU miss this.
Behavioral surface is huge. You care not just about accuracy but helpfulness, harmlessness, instruction-following, tool-use correctness, calibration, and format compliance. Each has its own evaluator.
Benchmark contamination. MMLU, HellaSwag, GSM8K have leaked into training data. Public-benchmark scores are no longer trustworthy for frontier models; you need private holdouts.

Interview budget numbers: a single MMLU run is 14k questions ≈ $30 on GPT-4-class pricing. A serious release regression run is 100k-1M samples across ~20 benchmarks, costing thousands to tens of thousands of dollars and 4-24 hours.

Source cross-reference

Chip Huyen's Designing ML Systems Ch.6 covers offline eval taxonomy and slice-based eval. Gulli's Agentic Design Patterns Ch.19 covers agent eval specifically (trajectory analysis, end-task success, failure-mode taxonomy). Also worth reading: HELM (Liang et al.), MT-Bench/Arena-Hard (lmsys), and the BIG-bench paper.

Offline benchmarks: MMLU, GPQA, HELM, and their limits

Know these by heart:

Benchmark	What it measures	Size	Ceiling / SOTA
MMLU	57 subjects multi-choice	14k	~90% (frontier); contaminated
GPQA	PhD-level multi-choice, Google-proof	448	~60% (o1); <40% for most
HumanEval / MBPP	Python code synthesis	164 / 974	saturated at ~95%; use LiveCodeBench instead
GSM8K / MATH	Math word problems	8.5k / 12.5k	>95% on GSM8K; MATH still stretching
HELM	Holistic, multi-metric (accuracy, calibration, robustness, fairness, bias, toxicity, efficiency)	varies	research standard
SWE-bench / SWE-bench Verified	Real GitHub issues resolved	2.3k / 500	~70% (Claude + agents)
Tau-bench, WebArena	Agent tool-use, browsing	varies	active research

When a benchmark score is in a press release, assume contamination unless the paper reports a decontamination methodology. Frontier labs now run private holdouts (held-out test sets never public) to get trustworthy numbers.

LLM-as-judge and its biases

For open-ended outputs, the standard eval is: show two responses to a judge LLM, ask which is better, aggregate pairwise preferences into an Elo score. MT-Bench and Arena-Hard (lmsys.org) do this at scale.

Known biases:

Position bias. The judge prefers response A over B more often when shown first. Fix: swap positions and average.
Verbosity bias. Longer responses are rated higher even when wrong. Fix: length-normalize or use a length-debiased Elo (lmsys publishes one).
Self-preference. A model judge prefers outputs that sound like its own. Fix: use multiple diverse judges (GPT-4 + Claude + Gemini) and average.
Format sycophancy. Outputs with bullet points and headers get higher scores. Fix: instruct the judge to ignore format.

Despite biases, LLM-as-judge correlates 0.8+ with human ratings on many tasks — good enough for regression detection, not good enough for novel capability evaluation.

Anti-pattern

Using GPT-4 as judge to evaluate GPT-4 outputs. Self-preference + shared failure modes. Use a different family of model as judge, or use humans for the critical slice.

Online eval, A/B testing, guardrails

Offline is a proxy; online is truth. But LLMs make online eval harder: responses are different every time, users can't be blinded, and quality is subjective.

Guardrailed A/B

Define north-star metric (e.g., task completion rate, user satisfaction thumbs rate).
Define guardrail metrics that must not regress (latency p95, cost per request, refusal rate, unsafe-output rate, format-compliance rate).
Shadow or canary 1%→10%→50% of traffic.
If north-star up AND all guardrails flat → ship.

Pairwise Elo (Arena)

Inside a company, build a battle arena: route 1% of traffic to both the candidate and baseline, show both answers to a panel of internal raters or an LLM judge, aggregate into Elo. This is how lmsys Chatbot Arena scores frontier models; internal versions at OpenAI/Anthropic follow the same pattern.

Behavioral sampling

Log a stratified random sample of production traffic for human or automated review. Stratify by tenant, prompt category, refusal, and length. 1% of production is usually enough to catch regressions within hours.

Regression suites, red teaming, agent eval

Regression suites

Every LLM team maintains a golden set: ~1k-10k prompts hand-curated to cover capability buckets (reasoning, math, code, refusals, long-context recall, tool use, multilingual, safety). Run on every model checkpoint. Diff the outputs. Any flip from correct to incorrect on a high-traffic bucket is a ship-block.

Red teaming

Adversarial eval: humans (or another LLM) try to elicit policy violations. Outputs feed back into the safety RLHF pipeline. Anthropic's Usage Policies and the OpenAI Model Spec both grew from red-team findings. For interviews, mention Constitutional AI (Bai et al.) — Anthropic's method of using a "constitution" (set of written principles) to generate AI-generated feedback for training.

Agent eval

Agents fail in ways single-turn eval misses: infinite loops, wrong tool chosen, tool called with wrong args, hallucinated citation. You need trajectory-level eval:

End-task success: did the final state match gold?
Step-level rubric: were intermediate tool calls reasonable?
Cost/latency profile: tokens and wall-time per task.
Failure taxonomy: categorize every failure into a bucket for targeted improvement.

flowchart LR
  M[New model checkpoint] --> R[Regression suite
~10k prompts]
  M --> B[Public benchmarks
MMLU/GPQA/SWE-bench]
  M --> P[Private holdouts]
  R --> J[LLM-as-judge]
  R --> H[Human raters
500 samples]
  J --> D[Dashboard + diff]
  H --> D
  D --> G{All guardrails green?}
  G -->|yes| C[Canary 1%]
  G -->|no| F[Fix / retrain]
  C --> O[Online A/B
pairwise Elo]

OpenAI vs Anthropic eval culture, checklist

OpenAI-specific

OpenAI publishes "Evals" — an open-source framework (github.com/openai/evals) with hundreds of community-contributed eval templates. Their internal process emphasizes model-spec compliance: the Model Spec is a public document that every release is scored against. Ship decisions favor aggregate capability gains even at some behavioral regression, mitigated by post-training.

Anthropic-specific

Anthropic's public safety cards (released alongside each Claude version) include detailed eval tables: capability (MMLU, GPQA, agentic), safety (BBQ, refusal-rate, jailbreak robustness), and honesty (TruthfulQA-style probes). They emphasize Constitutional AI and red-teaming with internal and external partners before release. Interview hook: describe how you would reproduce a mini safety card for your model change.

Anti-patterns

Ship on MMLU +1.5%. Contamination + narrow capability. Always pair with behavioral evals.
Single judge for Elo. Self-preference. Use a panel.
No guardrail metrics in A/B. You'll regress latency, cost, or safety silently.
Evaluating in isolation from product. Eval prompts should be drawn from real traffic distribution, not synthetic.
No failure taxonomy. Without buckets, improvements don't compound.

Whiteboard checklist: define north-star + guardrails → offline suite (public + private + golden) → LLM-as-judge with bias fixes → human raters on stratified sample → regression gate → canary with guardrailed A/B → pairwise Elo arena → agent trajectory eval → red-team loop → public eval/safety card.

为什么 LLM 评估难

传统 ML 评估是"在留出集上算 loss"。LLM 三个方面打破这点：

开放输出。没有单一正确字符串——"Paris is the capital of France" 和 "France's capital is Paris" 都对。Exact-match 和 BLEU 抓不到。
行为面巨大。除准确率还要 helpfulness、无害、指令遵循、工具调用正确、校准、格式合规。每个都有评估器。
Benchmark 污染。MMLU、HellaSwag、GSM8K 已混入训练数据。前沿模型的公开分不再可信，必须用私有 holdout。

预算数：单跑 MMLU 1.4 万题 GPT-4 级约 $30。严肃发版回归要 10 万-100 万样本跨 ~20 个 benchmark，数千到数万美元、4-24 小时。

参考来源

Chip Huyen《Designing ML Systems》第 6 章（离线评估分类与切片评估）；Gulli《Agentic Design Patterns》第 19 章（agent 评估：轨迹分析、端任务成功、失败模式分类）。另读 HELM（Liang et al.）、MT-Bench/Arena-Hard（lmsys）、BIG-bench 论文。

离线 benchmark：MMLU、GPQA、HELM 及其局限

要背熟：

Benchmark	度量	规模	上限/SOTA
MMLU	57 学科多选	1.4 万	~90%（前沿），已污染
GPQA	博士级多选、Google-proof	448	~60%（o1），多数 <40%
HumanEval / MBPP	Python 代码合成	164 / 974	~95% 饱和，改用 LiveCodeBench
GSM8K / MATH	数学文字题	8.5k / 12.5k	GSM8K >95%，MATH 仍有空间
HELM	整体多指标（准确、校准、鲁棒、公平、偏见、毒性、效率）	变	研究标准
SWE-bench / Verified	真 GitHub issue 解决	2.3k / 500	~70%（Claude + agent）
Tau-bench、WebArena	agent 工具使用、浏览	变	活跃研究

新闻稿里看到的 benchmark 分，默认已污染——除非论文报告了去污染方法。前沿实验室现在跑私有 holdout（从未公开的测试集）拿真数。

LLM-as-judge 及其偏差

开放输出评估的标准做法：把两份响应给裁判 LLM，问哪个更好，配对偏好聚合成 Elo。MT-Bench 和 Arena-Hard（lmsys.org）大规模这么做。

已知偏差：

位置偏差。裁判看到 A 在先时更偏 A。修法：交换位置求平均。
长度偏差。更长响应即使错也评分更高。修法：长度归一化或用 lmsys 公开的长度去偏 Elo。
自我偏好。模型裁判偏好像自己的输出。修法：多样裁判组合（GPT-4 + Claude + Gemini）求均。
格式谄媚。带项目符号和标题的输出分更高。修法：指示裁判忽略格式。

偏差虽在，LLM-as-judge 在很多任务上与人类评分相关 0.8+——足够抓回归，不足以评估新能力。

反模式

用 GPT-4 做裁判评 GPT-4 输出。自我偏好 + 共享失败模式。用不同族的模型做裁判，或对关键切片用人评。

线上评估、A/B、guardrail

离线是代理，线上是真相。但 LLM 线上评估更难：响应每次不同、用户不盲测、质量主观。

Guardrail A/B

定义 north-star 指标（任务完成率、点赞率）。
定义不能退化的 guardrail 指标（延迟 p95、每请求成本、拒答率、不安全输出率、格式合规率）。
影子或金丝雀 1%→10%→50%。
north-star 涨 + guardrail 全平 → 发布。

配对 Elo（Arena）

公司内部建 battle arena：1% 流量同时路由到候选和基线，两份答案给内部评委或 LLM 裁判看，聚合成 Elo。lmsys Chatbot Arena 给前沿模型打分就这么做；OpenAI/Anthropic 内部版同样模式。

行为采样

分层随机采样生产流量做人工或自动审查。按租户、prompt 类别、拒答、长度分层。1% 生产流量通常几小时内就能抓回归。

回归套件、红队、agent 评估

回归套件

每个 LLM 团队都有golden set：1k-10k 条人工精选 prompt 覆盖能力分桶（推理、数学、代码、拒答、长上下文、工具、多语、安全）。每个 checkpoint 跑；diff 输出；高流量桶上任何从对变错是卡发。

红队

对抗评估：人类（或另一 LLM）试图诱发策略违规。输出反馈到安全 RLHF 管线。Anthropic 使用策略与 OpenAI Model Spec 都源自红队发现。面试可提 Constitutional AI（Bai et al.）——Anthropic 用"宪法"（一组书面原则）生成 AI 反馈用于训练的方法。

Agent 评估

Agent 失败方式单 turn 评估抓不到：无限循环、选错工具、参数错误、幻觉引用。要轨迹级评估：

端任务成功：终态是否匹配 gold？
步级 rubric：中间工具调用是否合理？
成本/延迟画像：每任务 token 和墙上时间。
失败分类：每个失败归类便于针对性改进。

flowchart LR
  M[新模型 checkpoint] --> R[回归套件
~10k prompt]
  M --> B[公开 benchmark
MMLU/GPQA/SWE-bench]
  M --> P[私有 holdout]
  R --> J[LLM-as-judge]
  R --> H[人类评委
500 样本]
  J --> D[Dashboard + diff]
  H --> D
  D --> G{guardrail 全绿?}
  G -->|yes| C[金丝雀 1%]
  G -->|no| F[修复/再训]
  C --> O[线上 A/B
配对 Elo]

OpenAI vs Anthropic 评估文化与清单

OpenAI 细节

OpenAI 开源了 "Evals"（github.com/openai/evals），数百社区贡献的评估模板。内部强调对 Model Spec 合规：Model Spec 是公开文档，每次发版都要对照打分。发版决策倾向聚合能力增益即使行为上有些退化、post-training 缓解。

Anthropic 细节

Anthropic 发布的安全卡（每个 Claude 版本）含详尽评估表：能力（MMLU、GPQA、agentic）、安全（BBQ、拒答率、越狱鲁棒）、诚实（TruthfulQA 风格）。强调 Constitutional AI 与内外部伙伴红队。面试点：说明你如何为自己的模型变更复现一个迷你安全卡。

反模式

MMLU +1.5% 就发。污染 + 能力单一。永远配行为评估。
单裁判 Elo。自我偏好。用 panel。
A/B 无 guardrail。延迟、成本、安全会悄悄退化。
评估脱离产品。评估 prompt 要来自真流量分布，而非合成。
无失败分类。没有桶，改进不叠加。

白板清单：定义 north-star + guardrail → 离线套件（公开 + 私有 + golden）→ LLM-as-judge 加偏差修正 → 分层样本人工评 → 回归门 → 金丝雀带 guardrail 的 A/B → 配对 Elo arena → agent 轨迹评估 → 红队循环 → 公开评估/安全卡。