Why LLM eval is hard

Classical ML eval is "compute loss on held-out test set". LLMs break this in three ways:

  • Open-ended outputs. There is no single correct string — "Paris is the capital of France" and "France's capital is Paris" are both right. Exact-match and BLEU miss this.
  • Behavioral surface is huge. You care not just about accuracy but helpfulness, harmlessness, instruction-following, tool-use correctness, calibration, and format compliance. Each has its own evaluator.
  • Benchmark contamination. MMLU, HellaSwag, GSM8K have leaked into training data. Public-benchmark scores are no longer trustworthy for frontier models; you need private holdouts.

Interview budget numbers: a single MMLU run is 14k questions ≈ $30 on GPT-4-class pricing. A serious release regression run is 100k-1M samples across ~20 benchmarks, costing thousands to tens of thousands of dollars and 4-24 hours.

Source cross-reference

Chip Huyen's Designing ML Systems Ch.6 covers offline eval taxonomy and slice-based eval. Gulli's Agentic Design Patterns Ch.19 covers agent eval specifically (trajectory analysis, end-task success, failure-mode taxonomy). Also worth reading: HELM (Liang et al.), MT-Bench/Arena-Hard (lmsys), and the BIG-bench paper.

Offline benchmarks: MMLU, GPQA, HELM, and their limits

Know these by heart:

BenchmarkWhat it measuresSizeCeiling / SOTA
MMLU57 subjects multi-choice14k~90% (frontier); contaminated
GPQAPhD-level multi-choice, Google-proof448~60% (o1); <40% for most
HumanEval / MBPPPython code synthesis164 / 974saturated at ~95%; use LiveCodeBench instead
GSM8K / MATHMath word problems8.5k / 12.5k>95% on GSM8K; MATH still stretching
HELMHolistic, multi-metric (accuracy, calibration, robustness, fairness, bias, toxicity, efficiency)variesresearch standard
SWE-bench / SWE-bench VerifiedReal GitHub issues resolved2.3k / 500~70% (Claude + agents)
Tau-bench, WebArenaAgent tool-use, browsingvariesactive research

When a benchmark score is in a press release, assume contamination unless the paper reports a decontamination methodology. Frontier labs now run private holdouts (held-out test sets never public) to get trustworthy numbers.

LLM-as-judge and its biases

For open-ended outputs, the standard eval is: show two responses to a judge LLM, ask which is better, aggregate pairwise preferences into an Elo score. MT-Bench and Arena-Hard (lmsys.org) do this at scale.

Known biases:

  • Position bias. The judge prefers response A over B more often when shown first. Fix: swap positions and average.
  • Verbosity bias. Longer responses are rated higher even when wrong. Fix: length-normalize or use a length-debiased Elo (lmsys publishes one).
  • Self-preference. A model judge prefers outputs that sound like its own. Fix: use multiple diverse judges (GPT-4 + Claude + Gemini) and average.
  • Format sycophancy. Outputs with bullet points and headers get higher scores. Fix: instruct the judge to ignore format.

Despite biases, LLM-as-judge correlates 0.8+ with human ratings on many tasks — good enough for regression detection, not good enough for novel capability evaluation.

Anti-pattern

Using GPT-4 as judge to evaluate GPT-4 outputs. Self-preference + shared failure modes. Use a different family of model as judge, or use humans for the critical slice.

Online eval, A/B testing, guardrails

Offline is a proxy; online is truth. But LLMs make online eval harder: responses are different every time, users can't be blinded, and quality is subjective.

Guardrailed A/B

  1. Define north-star metric (e.g., task completion rate, user satisfaction thumbs rate).
  2. Define guardrail metrics that must not regress (latency p95, cost per request, refusal rate, unsafe-output rate, format-compliance rate).
  3. Shadow or canary 1%→10%→50% of traffic.
  4. If north-star up AND all guardrails flat → ship.

Pairwise Elo (Arena)

Inside a company, build a battle arena: route 1% of traffic to both the candidate and baseline, show both answers to a panel of internal raters or an LLM judge, aggregate into Elo. This is how lmsys Chatbot Arena scores frontier models; internal versions at OpenAI/Anthropic follow the same pattern.

Behavioral sampling

Log a stratified random sample of production traffic for human or automated review. Stratify by tenant, prompt category, refusal, and length. 1% of production is usually enough to catch regressions within hours.

Regression suites, red teaming, agent eval

Regression suites

Every LLM team maintains a golden set: ~1k-10k prompts hand-curated to cover capability buckets (reasoning, math, code, refusals, long-context recall, tool use, multilingual, safety). Run on every model checkpoint. Diff the outputs. Any flip from correct to incorrect on a high-traffic bucket is a ship-block.

Red teaming

Adversarial eval: humans (or another LLM) try to elicit policy violations. Outputs feed back into the safety RLHF pipeline. Anthropic's Usage Policies and the OpenAI Model Spec both grew from red-team findings. For interviews, mention Constitutional AI (Bai et al.) — Anthropic's method of using a "constitution" (set of written principles) to generate AI-generated feedback for training.

Agent eval

Agents fail in ways single-turn eval misses: infinite loops, wrong tool chosen, tool called with wrong args, hallucinated citation. You need trajectory-level eval:

  • End-task success: did the final state match gold?
  • Step-level rubric: were intermediate tool calls reasonable?
  • Cost/latency profile: tokens and wall-time per task.
  • Failure taxonomy: categorize every failure into a bucket for targeted improvement.
flowchart LR
  M[New model checkpoint] --> R[Regression suite
~10k prompts] M --> B[Public benchmarks
MMLU/GPQA/SWE-bench] M --> P[Private holdouts] R --> J[LLM-as-judge] R --> H[Human raters
500 samples] J --> D[Dashboard + diff] H --> D D --> G{All guardrails green?} G -->|yes| C[Canary 1%] G -->|no| F[Fix / retrain] C --> O[Online A/B
pairwise Elo]

OpenAI vs Anthropic eval culture, checklist

OpenAI-specific

OpenAI publishes "Evals" — an open-source framework (github.com/openai/evals) with hundreds of community-contributed eval templates. Their internal process emphasizes model-spec compliance: the Model Spec is a public document that every release is scored against. Ship decisions favor aggregate capability gains even at some behavioral regression, mitigated by post-training.

Anthropic-specific

Anthropic's public safety cards (released alongside each Claude version) include detailed eval tables: capability (MMLU, GPQA, agentic), safety (BBQ, refusal-rate, jailbreak robustness), and honesty (TruthfulQA-style probes). They emphasize Constitutional AI and red-teaming with internal and external partners before release. Interview hook: describe how you would reproduce a mini safety card for your model change.

Anti-patterns

  • Ship on MMLU +1.5%. Contamination + narrow capability. Always pair with behavioral evals.
  • Single judge for Elo. Self-preference. Use a panel.
  • No guardrail metrics in A/B. You'll regress latency, cost, or safety silently.
  • Evaluating in isolation from product. Eval prompts should be drawn from real traffic distribution, not synthetic.
  • No failure taxonomy. Without buckets, improvements don't compound.

Whiteboard checklist: define north-star + guardrails → offline suite (public + private + golden) → LLM-as-judge with bias fixes → human raters on stratified sample → regression gate → canary with guardrailed A/B → pairwise Elo arena → agent trajectory eval → red-team loop → public eval/safety card.

为什么 LLM 评估难

传统 ML 评估是"在留出集上算 loss"。LLM 三个方面打破这点:

  • 开放输出。没有单一正确字符串——"Paris is the capital of France" 和 "France's capital is Paris" 都对。Exact-match 和 BLEU 抓不到。
  • 行为面巨大。除准确率还要 helpfulness、无害、指令遵循、工具调用正确、校准、格式合规。每个都有评估器。
  • Benchmark 污染。MMLU、HellaSwag、GSM8K 已混入训练数据。前沿模型的公开分不再可信,必须用私有 holdout。

预算数:单跑 MMLU 1.4 万题 GPT-4 级约 $30。严肃发版回归要 10 万-100 万样本跨 ~20 个 benchmark,数千到数万美元、4-24 小时。

参考来源

Chip Huyen《Designing ML Systems》第 6 章(离线评估分类与切片评估);Gulli《Agentic Design Patterns》第 19 章(agent 评估:轨迹分析、端任务成功、失败模式分类)。另读 HELM(Liang et al.)、MT-Bench/Arena-Hard(lmsys)、BIG-bench 论文。

离线 benchmark:MMLU、GPQA、HELM 及其局限

要背熟:

Benchmark度量规模上限/SOTA
MMLU57 学科多选1.4 万~90%(前沿),已污染
GPQA博士级多选、Google-proof448~60%(o1),多数 <40%
HumanEval / MBPPPython 代码合成164 / 974~95% 饱和,改用 LiveCodeBench
GSM8K / MATH数学文字题8.5k / 12.5kGSM8K >95%,MATH 仍有空间
HELM整体多指标(准确、校准、鲁棒、公平、偏见、毒性、效率)研究标准
SWE-bench / Verified真 GitHub issue 解决2.3k / 500~70%(Claude + agent)
Tau-bench、WebArenaagent 工具使用、浏览活跃研究

新闻稿里看到的 benchmark 分,默认已污染——除非论文报告了去污染方法。前沿实验室现在跑私有 holdout(从未公开的测试集)拿真数。

LLM-as-judge 及其偏差

开放输出评估的标准做法:把两份响应给裁判 LLM,问哪个更好,配对偏好聚合成 Elo。MT-Bench 和 Arena-Hard(lmsys.org)大规模这么做。

已知偏差:

  • 位置偏差。裁判看到 A 在先时更偏 A。修法:交换位置求平均。
  • 长度偏差。更长响应即使错也评分更高。修法:长度归一化或用 lmsys 公开的长度去偏 Elo。
  • 自我偏好。模型裁判偏好像自己的输出。修法:多样裁判组合(GPT-4 + Claude + Gemini)求均。
  • 格式谄媚。带项目符号和标题的输出分更高。修法:指示裁判忽略格式。

偏差虽在,LLM-as-judge 在很多任务上与人类评分相关 0.8+——足够抓回归,不足以评估新能力。

反模式

用 GPT-4 做裁判评 GPT-4 输出。自我偏好 + 共享失败模式。用不同族的模型做裁判,或对关键切片用人评。

线上评估、A/B、guardrail

离线是代理,线上是真相。但 LLM 线上评估更难:响应每次不同、用户不盲测、质量主观。

Guardrail A/B

  1. 定义 north-star 指标(任务完成率、点赞率)。
  2. 定义不能退化的 guardrail 指标(延迟 p95、每请求成本、拒答率、不安全输出率、格式合规率)。
  3. 影子或金丝雀 1%→10%→50%。
  4. north-star 涨 + guardrail 全平 → 发布。

配对 Elo(Arena)

公司内部建 battle arena:1% 流量同时路由到候选和基线,两份答案给内部评委或 LLM 裁判看,聚合成 Elo。lmsys Chatbot Arena 给前沿模型打分就这么做;OpenAI/Anthropic 内部版同样模式。

行为采样

分层随机采样生产流量做人工或自动审查。按租户、prompt 类别、拒答、长度分层。1% 生产流量通常几小时内就能抓回归。

回归套件、红队、agent 评估

回归套件

每个 LLM 团队都有golden set:1k-10k 条人工精选 prompt 覆盖能力分桶(推理、数学、代码、拒答、长上下文、工具、多语、安全)。每个 checkpoint 跑;diff 输出;高流量桶上任何从对变错是卡发。

红队

对抗评估:人类(或另一 LLM)试图诱发策略违规。输出反馈到安全 RLHF 管线。Anthropic 使用策略与 OpenAI Model Spec 都源自红队发现。面试可提 Constitutional AI(Bai et al.)——Anthropic 用"宪法"(一组书面原则)生成 AI 反馈用于训练的方法。

Agent 评估

Agent 失败方式单 turn 评估抓不到:无限循环、选错工具、参数错误、幻觉引用。要轨迹级评估

  • 端任务成功:终态是否匹配 gold?
  • 步级 rubric:中间工具调用是否合理?
  • 成本/延迟画像:每任务 token 和墙上时间。
  • 失败分类:每个失败归类便于针对性改进。
flowchart LR
  M[新模型 checkpoint] --> R[回归套件
~10k prompt] M --> B[公开 benchmark
MMLU/GPQA/SWE-bench] M --> P[私有 holdout] R --> J[LLM-as-judge] R --> H[人类评委
500 样本] J --> D[Dashboard + diff] H --> D D --> G{guardrail 全绿?} G -->|yes| C[金丝雀 1%] G -->|no| F[修复/再训] C --> O[线上 A/B
配对 Elo]

OpenAI vs Anthropic 评估文化与清单

OpenAI 细节

OpenAI 开源了 "Evals"(github.com/openai/evals),数百社区贡献的评估模板。内部强调对 Model Spec 合规:Model Spec 是公开文档,每次发版都要对照打分。发版决策倾向聚合能力增益即使行为上有些退化、post-training 缓解。

Anthropic 细节

Anthropic 发布的安全卡(每个 Claude 版本)含详尽评估表:能力(MMLU、GPQA、agentic)、安全(BBQ、拒答率、越狱鲁棒)、诚实(TruthfulQA 风格)。强调 Constitutional AI 与内外部伙伴红队。面试点:说明你如何为自己的模型变更复现一个迷你安全卡。

反模式

  • MMLU +1.5% 就发。污染 + 能力单一。永远配行为评估。
  • 单裁判 Elo。自我偏好。用 panel。
  • A/B 无 guardrail。延迟、成本、安全会悄悄退化。
  • 评估脱离产品。评估 prompt 要来自真流量分布,而非合成。
  • 无失败分类。没有桶,改进不叠加。

白板清单:定义 north-star + guardrail → 离线套件(公开 + 私有 + golden)→ LLM-as-judge 加偏差修正 → 分层样本人工评 → 回归门 → 金丝雀带 guardrail 的 A/B → 配对 Elo arena → agent 轨迹评估 → 红队循环 → 公开评估/安全卡。