When the user's intent is explicit, the job of the model is execution precision over many turns: satisfy constraints, track evolving requirements, refuse to drift under sycophantic pressure, ask a clarifying question when the request is ambiguous. This page covers the three principal domains — General IF, Math, and Coding.
Cross-domain benchmarks that exercise instruction precision across mixed-topic conversations. This is where the multi-turn literature started — and where most of the methodological debates about judges, rubrics, and constraint tracking still play out.
Multi-turn IF descends from MT-Bench — 80 two-turn dialogues with GPT-4 as judge, introduced by Zheng et al. (2023). Its influence on the field is hard to overstate: nearly every subsequent IF benchmark expands MT-Bench on one of four axes. Length (MT-Bench++, MT-Bench-101) stretches the horizon past two turns. Ability taxonomy (MT-Eval, bai et al.'s perceptivity / adaptability / interactivity split) decomposes what "good IF" means. Multilinguality (M2Lingual, Multi-IF) stresses cross-lingual constraint handling. Fairness (FairMT-Bench, 10k conversations) probes whether biased responses emerge over turns even when single-turn prompts look clean.
Later benchmarks sharpened "good IF" in subtler directions: clarification policy — when should the model ask rather than guess? — has gone from aspirational (AQA-Bench) to operational (Clarify-When-Necessary, Huang 2025). Structural conversation flow (StructFlowBench) and instruction hierarchy (IHEval) test whether models respect layered rules. Consistency under pressure (FIRM, with its PWC and CARG metrics) measures how easily user pushback flips a correct answer. TURNWISEEVAL pairs each multi-turn dialogue with a semantically matched single-turn variant to cleanly isolate the multi-turn effect.
The running theme across these benchmarks is that adding turns doesn't solve failures; it reveals new ones. Context decay over distance (bai2024mt, sirdeshmukh2025multichallenge, laban2025lost) degrades correctness even when every individual instruction is trivially simple. Sycophancy gets worse as dialogue lengthens. And basic clarification — the single most pedagogically valuable multi-turn behavior — remains absent in most models.
80 two-turn dialogs, GPT-4-as-judge. The foundation the rest of the field builds on.
Four-ability taxonomy: follow-up, refinement, expansion, recollection.
10k dialogues; explicit and implicit bias amplification across turns.
155 dialogs × 4.15 turns; structural-flow adherence under multi-turn constraints.
700 dialogs × 9 turns; First Sway Round, PWC, CARG — how sycophantic is the model under challenge?
Paired single- vs. multi-turn variants to isolate the multi-turn effect cleanly.
See the full audit on the Benchmarks page.
Math is the first domain where multi-turn decisively out-performs single-turn — iterative correction, tool-assisted arithmetic, and teaching-oriented scaffolding each require conversational turns, not longer prompts.
Multi-turn math benchmarks split cleanly into two evaluation families. The solving family — MathChat-Bench, MINT, SBSC — measures whether iterative feedback makes the model arrive at the correct final answer. MINT (≤5 turns) formalizes interactive feedback; MathChat organizes four follow-up sub-tasks (QA, error correction, error analysis, problem generation); SBSC pushes into Olympiad territory with AMC / AIME / MathOdyssey-level step-by-step code execution.
The tutoring family measures something different and arguably harder: pedagogical quality. Can the model scaffold, probe misconceptions, withhold the answer when appropriate? MathDial introduced this frame with 2,861 tutoring dialogues between teachers and student-LLM pairs, along with Success@k / Telling@k metrics. MRBench and Beyond Final Answers added rubric-driven evaluation; Intent Matters layered on pedagogical intent annotations. A striking empirical finding: fine-tuning a small model on MathDial outperforms prompting a much larger model on the same tutoring tasks.
On the methods side, M-DPO / M-KTO (xiong2409building) show that multi-turn preference optimization pays off — MATH performance rises 77.5 → 83.9 — and MathChat-Agent shows that the right protocol matters as much as the model size. Multi-turn math is also where the earliest work on distinguishing "solving" from "teaching" took shape; both frames now appear across CE domains too.
"The most important finding from multi-turn math evaluation is not that bigger models do better — it's that tutoring is a different capability from solving, and that SFT on tutoring dialogues beats prompting on tutoring tasks." — adapted from Macina et al., MathDial
2,861 tutoring dialogues; defines the teacher-student multi-turn frame.
Interactive reasoning with execution and language feedback; ≤5 turns.
Multi-turn preference optimization; +6-10% on MATH.
Rubric-driven tutor evaluation over pedagogical dimensions.
AMC / AIME / MathOdyssey step-by-step code-trace performance.
Evaluates tutoring quality beyond correctness of the final answer.
Coding is inherently conversational — specify, generate, execute, debug. Multi-turn coding benchmarks pair generation with execution feedback and test whether the model can refine, clarify, and switch modalities (text ↔ code) appropriately.
The interactive-coding lineage begins with CodeGen / MTPB (2022) and InterCode (2023), which introduced a sandbox loop with Try-again / ReAct / Plan-and-Solve protocols. The critical observation: pairing generation with execution feedback closes most of the gap between small and large models on many coding tasks — the model simply needs a signal to correct against.
Methods split along three axes. Data-driven refinement — OpenCodeInterpreter's 68k-pair Code-Feedback dataset shows that training on iterative refinement dialogues transfers to real use. Modality steering — CodeSteer learns when to reason in text versus when to emit code and execute. Socratic debugging — TreeInstruct structures questions, not fixes, as the bug-finding signal; ClarifyGPT pushes the same idea into specification.
Domain-specific branches cover SQL (MMSQL's 6,493 dialogs with EM / EX / Dual Assessment, DySQLBench's dynamic multi-turn text-to-SQL), EHR agents (EHRAgent), and security (MT-Sec adds security-aware scoring to correctness). The newest wave — CONVCODEWORLD, CodeFlowBench, MultiCodeIF — pushes iterative refinement horizons longer and formalizes fine-grained IF-with-feedback. The open research question is which iterative protocol converges fastest and on what class of problem — still unclear empirically.
Try-again / ReAct / Plan-and-Solve; 10-turn coding interaction sandbox.
68k Code-Feedback pairs; closes the gap with proprietary interpreters.
Learned text-vs-code modality switching.
Conversational text-to-SQL with reformulation and follow-ups.
Tree-structured Socratic debugging — questions as the primary signal.
Correctness + security scoring over multi-turn code generation.
See the full coding benchmark audit on the Benchmarks page.
Conversational Engagement — where intent is open and the model plays a consultative role — covers healthcare, education, role-play, and jailbreak.
Explore Conversational Engagement →