Task Family · Part II.A

Multi-Turn Instruction Following.

When the user's intent is explicit, the job of the model is execution precision over many turns: satisfy constraints, track evolving requirements, refuse to drift under sycophantic pressure, ask a clarifying question when the request is ambiguous. This page covers the three principal domains — General IF, Math, and Coding.

General IF Math Coding

IF — Part 1

General Instruction Following

Cross-domain benchmarks that exercise instruction precision across mixed-topic conversations. This is where the multi-turn literature started — and where most of the methodological debates about judges, rubrics, and constraint tracking still play out.

Multi-turn IF descends from MT-Bench — 80 two-turn dialogues with GPT-4 as judge, introduced by Zheng et al. (2023). Its influence on the field is hard to overstate: nearly every subsequent IF benchmark expands MT-Bench on one of four axes. Length (MT-Bench++, MT-Bench-101) stretches the horizon past two turns. Ability taxonomy (MT-Eval, bai et al.'s perceptivity / adaptability / interactivity split) decomposes what "good IF" means. Multilinguality (M2Lingual, Multi-IF) stresses cross-lingual constraint handling. Fairness (FairMT-Bench, 10k conversations) probes whether biased responses emerge over turns even when single-turn prompts look clean.

Later benchmarks sharpened "good IF" in subtler directions: clarification policy — when should the model ask rather than guess? — has gone from aspirational (AQA-Bench) to operational (Clarify-When-Necessary, Huang 2025). Structural conversation flow (StructFlowBench) and instruction hierarchy (IHEval) test whether models respect layered rules. Consistency under pressure (FIRM, with its PWC and CARG metrics) measures how easily user pushback flips a correct answer. TURNWISEEVAL pairs each multi-turn dialogue with a semantically matched single-turn variant to cleanly isolate the multi-turn effect.

The running theme across these benchmarks is that adding turns doesn't solve failures; it reveals new ones. Context decay over distance (bai2024mt, sirdeshmukh2025multichallenge, laban2025lost) degrades correctness even when every individual instruction is trivially simple. Sycophancy gets worse as dialogue lengthens. And basic clarification — the single most pedagogically valuable multi-turn behavior — remains absent in most models.

MT-Bench

General · 2023

80 two-turn dialogs, GPT-4-as-judge. The foundation the rest of the field builds on.

MT-Eval

General · 2024

Four-ability taxonomy: follow-up, refinement, expansion, recollection.

FairMT-Bench

Fairness · 2024

10k dialogues; explicit and implicit bias amplification across turns.

StructFlowBench

Structure · 2025

155 dialogs × 4.15 turns; structural-flow adherence under multi-turn constraints.

FIRM

Consistency · 2025

700 dialogs × 9 turns; First Sway Round, PWC, CARG — how sycophantic is the model under challenge?

TURNWISEEVAL

General · 2026

Paired single- vs. multi-turn variants to isolate the multi-turn effect cleanly.

See the full audit on the Benchmarks page.

IF — Part 2

Mathematical Dialogue

Math is the first domain where multi-turn decisively out-performs single-turn — iterative correction, tool-assisted arithmetic, and teaching-oriented scaffolding each require conversational turns, not longer prompts.

Multi-turn math benchmarks split cleanly into two evaluation families. The solving family — MathChat-Bench, MINT, SBSC — measures whether iterative feedback makes the model arrive at the correct final answer. MINT (≤5 turns) formalizes interactive feedback; MathChat organizes four follow-up sub-tasks (QA, error correction, error analysis, problem generation); SBSC pushes into Olympiad territory with AMC / AIME / MathOdyssey-level step-by-step code execution.

The tutoring family measures something different and arguably harder: pedagogical quality. Can the model scaffold, probe misconceptions, withhold the answer when appropriate? MathDial introduced this frame with 2,861 tutoring dialogues between teachers and student-LLM pairs, along with Success@k / Telling@k metrics. MRBench and Beyond Final Answers added rubric-driven evaluation; Intent Matters layered on pedagogical intent annotations. A striking empirical finding: fine-tuning a small model on MathDial outperforms prompting a much larger model on the same tutoring tasks.

On the methods side, M-DPO / M-KTO (xiong2409building) show that multi-turn preference optimization pays off — MATH performance rises 77.5 → 83.9 — and MathChat-Agent shows that the right protocol matters as much as the model size. Multi-turn math is also where the earliest work on distinguishing "solving" from "teaching" took shape; both frames now appear across CE domains too.

"The most important finding from multi-turn math evaluation is not that bigger models do better — it's that tutoring is a different capability from solving, and that SFT on tutoring dialogues beats prompting on tutoring tasks." — adapted from Macina et al., MathDial

MathDial

Tutor · 2023

2,861 tutoring dialogues; defines the teacher-student multi-turn frame.

MINT

Solve · 2023

Interactive reasoning with execution and language feedback; ≤5 turns.

M-DPO / M-KTO

Method · 2024

Multi-turn preference optimization; +6-10% on MATH.

MRBench

Tutor · 2025

Rubric-driven tutor evaluation over pedagogical dimensions.

SBSC

Olympiad · 2025

AMC / AIME / MathOdyssey step-by-step code-trace performance.

Beyond Final Answers

Tutor · 2025

Evaluates tutoring quality beyond correctness of the final answer.

IF — Part 3

Interactive Coding

Coding is inherently conversational — specify, generate, execute, debug. Multi-turn coding benchmarks pair generation with execution feedback and test whether the model can refine, clarify, and switch modalities (text ↔ code) appropriately.

The interactive-coding lineage begins with CodeGen / MTPB (2022) and InterCode (2023), which introduced a sandbox loop with Try-again / ReAct / Plan-and-Solve protocols. The critical observation: pairing generation with execution feedback closes most of the gap between small and large models on many coding tasks — the model simply needs a signal to correct against.

Methods split along three axes. Data-driven refinement — OpenCodeInterpreter's 68k-pair Code-Feedback dataset shows that training on iterative refinement dialogues transfers to real use. Modality steering — CodeSteer learns when to reason in text versus when to emit code and execute. Socratic debugging — TreeInstruct structures questions, not fixes, as the bug-finding signal; ClarifyGPT pushes the same idea into specification.

Domain-specific branches cover SQL (MMSQL's 6,493 dialogs with EM / EX / Dual Assessment, DySQLBench's dynamic multi-turn text-to-SQL), EHR agents (EHRAgent), and security (MT-Sec adds security-aware scoring to correctness). The newest wave — CONVCODEWORLD, CodeFlowBench, MultiCodeIF — pushes iterative refinement horizons longer and formalizes fine-grained IF-with-feedback. The open research question is which iterative protocol converges fastest and on what class of problem — still unclear empirically.

Execution feedback is the single most effective signal in interactive coding. Almost every method in this space converges on some form of it — the question is how to structure the loop.

InterCode

Coding · 2023

Try-again / ReAct / Plan-and-Solve; 10-turn coding interaction sandbox.

OpenCodeInterpreter

Method · 2024

68k Code-Feedback pairs; closes the gap with proprietary interpreters.

CodeSteer

Method · 2025

Learned text-vs-code modality switching.

MMSQL · DySQLBench

SQL · 2024-25

Conversational text-to-SQL with reformulation and follow-ups.

TreeInstruct

Debug · 2024

Tree-structured Socratic debugging — questions as the primary signal.

MT-Sec

Security · 2025

Correctness + security scoring over multi-turn code generation.

See the full coding benchmark audit on the Benchmarks page.

Continue

The other task family.

Conversational Engagement — where intent is open and the model plays a consultative role — covers healthcare, education, role-play, and jailbreak.

Explore Conversational Engagement →