OpenAI ★★ Frequent Medium EvalsBenchmarkRegression

O30 · Design an Evals Platform O30 · 设计评估平台

Verified source经核实出处

OpenAI Evals is open source (github.com/openai/evals). Asked at onsites. Credibility A.

Architecture架构

flowchart LR
  Authors --> REG[Eval Registry]
  REG --> RUN[Run Orchestrator]
  RUN --> INF[Target model endpoint]
  RUN --> GR[Grader - rule / LLM]
  GR --> DB[(Results DB)]
  DB --> UI[Compare UI]
  DB --> ALERT[Regression alert]

Key decisions关键决策

  • **Eval as code + data**: YAML descriptor + JSONL dataset + pluggable grader class.**eval 即代码 + 数据**:YAML + JSONL + 可插拔 grader。
  • **Deterministic replay**: same seed + model + prompt gives byte-identical result; stored as artefact.**可复现**:同 seed + 模型 + prompt -> 字节一致;作为制品存档。
  • **LLM-as-judge** with calibration set; fall back to rule-based if judge drifts.**LLM-as-judge** + 校准集;漂移时回退规则。
  • **Regression gate in CI**: release blocked if any eval drops > threshold vs baseline.**CI 回归 gate**:任何 eval 下降超阈则阻断发布。

Follow-ups追问

  • Non-determinism? sample N, report mean ± stderr.非确定?采样 N 次,报均值±标准误。
  • Cost control? smoke-test subset before full run.成本?先 smoke test 子集。

Related study-guide topics相关学习手册专题