Task Family · Part II.B

Multi-Turn Conversational Engagement.

When intent is open, partial, or evolving, the model stops being an instruction-executor and becomes a consultant: doctor, tutor, character, collaborator. Evaluation measures trajectory quality — empathy, pedagogy, proactivity, safety, realism over long horizons — not just final-turn correctness.

CE — Part 1

Overview & general benchmarks

The CE family has its own "trunk" of general benchmarks that pre-date or generalize across domains. Together they trace a methodological arc: from labour-intensive dimensional human evaluation to rubric-scaffolded LLM judging.

The CE lineage starts with ABC-Eval (2023) — 400 human-authored dialogues, each scored on 16 behavioral dimensions (consistency, emotion, understanding, engagingness, grammar, informativeness, quality, proactivity, relevance). ABC-Eval is the gold standard of what careful multi-turn evaluation costs: unreproducibly expensive in its original form. BotChat and DialogBench push toward LLM-judge paradigms (UniEval, PairEval, GTEval, bilingual human-likeness) to scale evaluation to hundreds or thousands of dialogues.

The most important 2025 contribution in this space is MultiChallenge, which introduces instance-level rubrics specifically over the hard multi-turn failure modes — retention, recall, revision, and non-sycophancy — and validates judge-human alignment on each. A parallel methodological thread is SimulatorArena, which asks whether user simulators (increasingly common in evaluation pipelines) are reliable stand-ins for real humans, and proposes behavioral-realism checks.

16
ABC-Eval dimensions
Human-judged: consistency, emotion, proactivity, relevance, and twelve more.
4
MultiChallenge failure axes
Retention, recall, revision, non-sycophancy.
9.8k
DialogBench dialogs
Bilingual human-likeness comparison at scale.
~80%
LLM-judge agreement
With human raters — but with self-enhancement and verbosity bias.
CE — Healthcare

Medical consultation dialogue

The most methodologically mature CE sub-domain. Medical dialogue must simulate a doctor: proactive clarification ("chain-of-questioning"), history retention across turns, and grounding in verified medical knowledge.

Medical-LLM development follows three arcs. The SFT-based arc — DISC-MedLLM, BianQue (whose chain-of-questioning explicitly models the proactivity a good clinician practises), BiMediX bilingual, CPsyCoun and PsycoLLM for mental health, SMILE for counseling sessions scaled from single-QA pairs into multi-turn dialogues. The full pipeline arc — Zhongjing adds pretraining + SFT on CMeKG + RLHF, and HuaTuoGPT I & II unifies stylistic transfer across dialogue and QA. The DPO-extended arc — Aquila-Med, Qilin-Med — adds preference learning over physician-style responses.

The methodological peak of this domain is Google's AMIE: a doctor agent trained via self-play over chain-of-reasoning dialogues, whose consultation quality matches or exceeds clinicians on several benchmarks. Evaluation has shifted from static QA (MedGPTEval, Liao-2023) to simulation-based (AIE/SAPS, MMD-Eval, MediQ), robustness (MedFuzz), and rubric-driven (HealthBench — 5,000 dialogues and 48k+ physician-authored rubrics).

The 2026 wave sharpens the rubrics further. MedMT-Bench probes long-horizon coherence (22-turn average). CPGBench grades guideline adherence. MedDialBench formalizes differential diagnosis over sustained dialogue. MedDialogRubrics brings large-scale rubric-based evaluation. MEDPI adds accreditation-grade scenarios. MindEval isolates long-horizon mental-health dialogue with dual-annotator (patient + expert) quality scoring. MINT stresses premature-commitment failure modes.

The newest generation of medical benchmarks does not ask "did the model give the right answer?" — it asks "did the model know when not to give an answer yet?"

AMIE

Healthcare · 2024

Self-play chain-of-reasoning doctor agent; matches or exceeds clinicians on consultation benchmarks.

HuatuoGPT I/II · Zhongjing

Healthcare

Full pipeline: pretrain + SFT + RLHF; Chinese medical knowledge graph grounding.

MediQ

Healthcare · 2024

Active-inquiry benchmark: tests whether the model commits prematurely.

HealthBench

Healthcare · 2025

5,000 dialogues × 48k+ rubrics, cross-language, physician-authored and annotated.

MedMT-Bench

Healthcare · 2026

22-turn long-horizon coherence, memory, and reasoning over sustained medical dialogue.

MindEval

Mental Health · 2026

Standalone long-horizon mental-health evaluation with patient + expert dual annotation.

CE — Education

Educational dialogue & tutoring

Three pillars: Intelligent Tutoring Systems, Automated Feedback & Grading, and Scenario Simulation. Together they form the most practically deployed CE sub-field — LearnLM and Claude for Education are already in classrooms.

Intelligent tutoring decomposes into Socratic / strategy-guided agents (SocraticLM, TreeInstruct, StratL), adaptive agents (PACE, JeepyTA, CourseAssist), and math-tutoring benchmarks (MathDial, SocraticMATH, MathTutorBench, Levonian-Algebra-Align, Scarlatos-DPO-Tutor). A recurring finding: fine-tuning small models on tutoring dialogues outperforms prompting large models on the same tutoring task.

Automated feedback & grading is where deployed-system research concentrates. Stepwise verification (Daheim 2024), preference-optimized feedback (Scarlatos 2024), iterative revision loops, and grading studies with mixed-but-encouraging evidence: LLM grading reaches ~95% teacher-match on Spanish short-answer items, ~70% within-10% on full-exam scoring, while essay scoring remains weak.

Scenario simulation spans generative students with personalities (SimClass, multi-party classrooms), at-risk cohort simulators, lesson-plan agents, and textbook-to-dialogue transformation (Book2Dial). The 2025-26 wave brings KMP-Bench (key misconception probing), MRBench, TutorBench, SAFETUTORS (safety in tutoring), EduDial (34k tutoring conversations), TeachLM (100k-hour corpus), and ConvoLearn (live learner behavior at scale).

MathDial

Tutor · 2023

Teacher-student dialogues; defines the tutoring multi-turn frame.

SocraticLM

Tutor · 2024

Socratic questioning at scale (35k dialogues, ~5 turns).

EduDial

Tutor · 2026

34,250 tutoring conversations — one of the largest corpora in the domain.

SAFETUTORS

Safety · 2026

Safety probes specifically in tutoring deployments.

TeachLM

Production · 2026

100k-hour corpus of real tutor-student interactions.

ConvoLearn

Production · 2026

Live learner behavior at scale; 2,134 long-form sessions.

CE — Role-Play

Role-play & persona-grounded dialogue

Role-play is the one CE sub-family we organize by technical pathway rather than application domain — because persona techniques transfer across every other domain too (medical role-play, student role-play, jailbreak persona attacks).

The role-play lineage traces the whole LLM-alignment playbook in miniature. ICL-era persona prompting — PersonaLLM, CharacterChat, role-play prompting (which improves not just consistency but also reasoning). SFT-era persona dataPIPPA (1M messages from forums), UltraChat, PRODIGy (movie-profile grounded). Character-specific SFTChatHaruhi, CharacterGLM, RoleCraft-GLM, Ditto (self-alignment), and CharacterLLM (Wikipedia-to-dialogue pipelines).

On top of full SFT sit personalization modules: PersonaPKT (persona embeddings), PPlug (user encoders), Neeko (dynamic LoRA), MIDI-Tuning. Then RL for consistency: Shea-Yu Offline RL, COMEDY (compressive memory), Abdulhai-Consistent-Persona.

Evaluation has developed in parallel as a layered ladder: LaMPCharacterEvalRoleEvalTimeChara (temporal consistency) → InCharacter (psychology-grounded) → RoleInteract (social dynamics) → CROSS (profile synthesis) → SocialBench, RoleLLM, RAIDEN, RMTBench, CharacterBench, RoleMRC, PersonaConvBench, PERSONAMEM. The persistent open problem is long-conversation degradation — persona fidelity decays as context grows.

PIPPA

Data · 2023

25,940 conversations averaging 40 turns; foundational persona SFT corpus.

CharacterLLM

Method · 2023

Wikipedia-to-dialogue pipeline; character-grounded SFT at scale.

Neeko

Method · 2024

Dynamic LoRA per persona for multi-character dialogue.

InCharacter

Eval · 2024

Psychology-grounded persona evaluation.

RMTBench

Eval · 2025

Role memory over ~20-turn long-horizon conversations.

PERSONAMEM

Eval · 2025

Personalization memory: does the model remember what you told it earlier?

CE — Jailbreak

Adversarial multi-turn dialogue

Multi-turn dialogue enables attacks that single-turn defences do not anticipate. Two dominant strategies — implicit shift and decomposition — plus a rapidly growing set of human-authored benchmarks that expose auto-eval blind spots.

Single-turn jailbreak attacks — GCG, AutoDAN, Carlini 2024 — evolved into conversational refinement (PAIR, Johnny/PAP with 92%+ ASR on frontier models). From there, two dominant multi-turn strategies emerged.

Implicit shift (Crescendo): begin with a benign seed, gradually pivot toward the target over multiple turns. ActorAttack extends Crescendo by self-discovering clue chains grounded in Latour's actor-network theory — it finds intermediate entities that plausibly bridge benign to harmful. Decomposition: split a harmful request into innocuous sub-requests, then re-aggregate at the last turn (zhou2024speak, MR.JA, RED QUEEN with scenario disguise, Liu-ImposterAI).

Dataset evolution traces AdvBench/HarmBench seeds → cipher / decomposed variants (Gibbs 2024's 6,536-dialog cipher jailbreak) → native multi-turn corpora (MHJ — 537 expert-authored dialogs with a tactic taxonomy, SafeDialBench — 4,053 dialogs across 22 scenarios, MTJ-Bench, FITD, RACE, SEMA, Tempest, Siren). A consistent finding: expert human attackers reach 70%+ ASR on frontier models, even when automated benchmarks report single-digit ASR on the same models.

Defenses remain scarce. CoSafe explores CoT-based harm detection, NeMoGuardrails implements guardrails that overblock, and X-Boundary targets the safety-usability boundary explicitly. A darker thread of recent work — Persona Jailbreaking, Echo Chamber, Mastermind — weaponizes the model's own history and persona scaffolding to bypass its later safety checks.

Automated multi-turn jailbreak evaluation systematically under-reports real attack success. Human red-teamers consistently achieve 10-20× higher ASR on the same models.

Crescendo

Attack · 2024

Implicit-shift multi-turn: benign priming → gradual pivot to target.

ActorAttack

Attack · 2024

Self-discovered clue chains via actor-network-theoretic search.

MHJ

Bench · 2024

537 expert-authored human multi-turn jailbreaks; tactic taxonomy; exposes auto-eval gap.

SafeDialBench

Bench · 2025

4,053 dialogues across 22 scenarios with fine-grained safety taxonomy.

CoSafe

Defense · 2024

CoT-based harm detection as a defensive pattern.

X-Boundary

Defense · 2025

Explicit targeting of the safety-usability trade-off boundary.


Continue

How the field improves these systems.

The methods review covers the three improvement pillars — model-centric, external integration, and agent-based — spanning every domain on this page.

Go to Methods →