OpenAI ★★ Frequent Medium EvalsBenchmarkRegression

O30 · Design an Evals Platform O30 · 设计评估平台

Verified source经核实出处

OpenAI Evals is open source (github.com/openai/evals). Asked at onsites. Credibility A.

Architecture架构

flowchart LR
  Authors --> REG[Eval Registry]
  REG --> RUN[Run Orchestrator]
  RUN --> INF[Target model endpoint]
  RUN --> GR[Grader - rule / LLM]
  GR --> DB[(Results DB)]
  DB --> UI[Compare UI]
  DB --> ALERT[Regression alert]

Key decisions关键决策

**Eval as code + data**: YAML descriptor + JSONL dataset + pluggable grader class.**eval 即代码 + 数据**：YAML + JSONL + 可插拔 grader。
**Deterministic replay**: same seed + model + prompt gives byte-identical result; stored as artefact.**可复现**：同 seed + 模型 + prompt -> 字节一致；作为制品存档。
**LLM-as-judge** with calibration set; fall back to rule-based if judge drifts.**LLM-as-judge** + 校准集；漂移时回退规则。
**Regression gate in CI**: release blocked if any eval drops > threshold vs baseline.**CI 回归 gate**：任何 eval 下降超阈则阻断发布。

Follow-ups追问

Non-determinism? sample N, report mean ± stderr.非确定？采样 N 次，报均值±标准误。
Cost control? smoke-test subset before full run.成本？先 smoke test 子集。

O30 · Design an Evals Platform O30 · 设计评估平台

Verified source经核实出处

Architecture架构

Key decisions关键决策

Follow-ups追问

Related study-guide topics相关学习手册专题