Benchmark Explorer · Part III

A unified audit of multi-turn LLM benchmarks.

Search and filter across 85+ benchmarks from 2018-2026 — both instruction-following and conversational-engagement families. For each benchmark we record the dialogue count, average turns, whether data was human- or LLM-curated, and the evaluation protocol (rule-based, human-judge, LLM-judge). Columns sort on click.

Benchmark Family · Domain Year Dialogs Avg Turns Rule Human LLM Criteria
Loading benchmarks…

Dialog counts are paper-reported; some list sizes are approximate. Turns are averages where available. "Rule / Human / LLM" columns indicate the evaluation protocol(s) used in the benchmark's own paper — many benchmarks combine several. For complete methodology details, see the paper's benchmark-data-overview appendix.

Why this table exists

Prior surveys often listed benchmarks without comparable schema. We standardized every benchmark on the same columns so that like-for-like comparison across methodological choices is possible.

How to read the ✓/✗ columns

A under Rule means the benchmark uses at least one deterministic rule-based metric. Multiple ticks indicate combined evaluation — for example, HealthBench uses both human and LLM judging with physician-authored rubrics.

What's missing

Benchmarks without explicit multi-turn design are excluded, even if they test dialogue. Agentic or multi-modal benchmarks are out of scope — see the paper's Introduction for the full boundary-setting criteria.