Search and filter across 85+ benchmarks from 2018-2026 — both instruction-following and conversational-engagement families. For each benchmark we record the dialogue count, average turns, whether data was human- or LLM-curated, and the evaluation protocol (rule-based, human-judge, LLM-judge). Columns sort on click.
| Benchmark | Family · Domain | Year | Dialogs | Avg Turns | Rule | Human | LLM | Criteria |
|---|---|---|---|---|---|---|---|---|
| Loading benchmarks… | ||||||||
Dialog counts are paper-reported; some list sizes are approximate. Turns are averages where available. "Rule / Human / LLM" columns indicate the evaluation protocol(s) used in the benchmark's own paper — many benchmarks combine several. For complete methodology details, see the paper's benchmark-data-overview appendix.
Prior surveys often listed benchmarks without comparable schema. We standardized every benchmark on the same columns so that like-for-like comparison across methodological choices is possible.
A ✓ under Rule means the benchmark uses at least one deterministic rule-based metric. Multiple ticks indicate combined evaluation — for example, HealthBench uses both human and LLM judging with physician-authored rubrics.
Benchmarks without explicit multi-turn design are excluded, even if they test dialogue. Agentic or multi-modal benchmarks are out of scope — see the paper's Introduction for the full boundary-setting criteria.