When intent is open, partial, or evolving, the model stops being an instruction-executor and becomes a consultant: doctor, tutor, character, collaborator. Evaluation measures trajectory quality — empathy, pedagogy, proactivity, safety, realism over long horizons — not just final-turn correctness.
The CE family has its own "trunk" of general benchmarks that pre-date or generalize across domains. Together they trace a methodological arc: from labour-intensive dimensional human evaluation to rubric-scaffolded LLM judging.
The CE lineage starts with ABC-Eval (2023) — 400 human-authored dialogues, each scored on 16 behavioral dimensions (consistency, emotion, understanding, engagingness, grammar, informativeness, quality, proactivity, relevance). ABC-Eval is the gold standard of what careful multi-turn evaluation costs: unreproducibly expensive in its original form. BotChat and DialogBench push toward LLM-judge paradigms (UniEval, PairEval, GTEval, bilingual human-likeness) to scale evaluation to hundreds or thousands of dialogues.
The most important 2025 contribution in this space is MultiChallenge, which introduces instance-level rubrics specifically over the hard multi-turn failure modes — retention, recall, revision, and non-sycophancy — and validates judge-human alignment on each. A parallel methodological thread is SimulatorArena, which asks whether user simulators (increasingly common in evaluation pipelines) are reliable stand-ins for real humans, and proposes behavioral-realism checks.
The most methodologically mature CE sub-domain. Medical dialogue must simulate a doctor: proactive clarification ("chain-of-questioning"), history retention across turns, and grounding in verified medical knowledge.
Medical-LLM development follows three arcs. The SFT-based arc — DISC-MedLLM, BianQue (whose chain-of-questioning explicitly models the proactivity a good clinician practises), BiMediX bilingual, CPsyCoun and PsycoLLM for mental health, SMILE for counseling sessions scaled from single-QA pairs into multi-turn dialogues. The full pipeline arc — Zhongjing adds pretraining + SFT on CMeKG + RLHF, and HuaTuoGPT I & II unifies stylistic transfer across dialogue and QA. The DPO-extended arc — Aquila-Med, Qilin-Med — adds preference learning over physician-style responses.
The methodological peak of this domain is Google's AMIE: a doctor agent trained via self-play over chain-of-reasoning dialogues, whose consultation quality matches or exceeds clinicians on several benchmarks. Evaluation has shifted from static QA (MedGPTEval, Liao-2023) to simulation-based (AIE/SAPS, MMD-Eval, MediQ), robustness (MedFuzz), and rubric-driven (HealthBench — 5,000 dialogues and 48k+ physician-authored rubrics).
The 2026 wave sharpens the rubrics further. MedMT-Bench probes long-horizon coherence (22-turn average). CPGBench grades guideline adherence. MedDialBench formalizes differential diagnosis over sustained dialogue. MedDialogRubrics brings large-scale rubric-based evaluation. MEDPI adds accreditation-grade scenarios. MindEval isolates long-horizon mental-health dialogue with dual-annotator (patient + expert) quality scoring. MINT stresses premature-commitment failure modes.
The newest generation of medical benchmarks does not ask "did the model give the right answer?" — it asks "did the model know when not to give an answer yet?"
Self-play chain-of-reasoning doctor agent; matches or exceeds clinicians on consultation benchmarks.
Full pipeline: pretrain + SFT + RLHF; Chinese medical knowledge graph grounding.
Active-inquiry benchmark: tests whether the model commits prematurely.
5,000 dialogues × 48k+ rubrics, cross-language, physician-authored and annotated.
22-turn long-horizon coherence, memory, and reasoning over sustained medical dialogue.
Standalone long-horizon mental-health evaluation with patient + expert dual annotation.
Three pillars: Intelligent Tutoring Systems, Automated Feedback & Grading, and Scenario Simulation. Together they form the most practically deployed CE sub-field — LearnLM and Claude for Education are already in classrooms.
Intelligent tutoring decomposes into Socratic / strategy-guided agents (SocraticLM, TreeInstruct, StratL), adaptive agents (PACE, JeepyTA, CourseAssist), and math-tutoring benchmarks (MathDial, SocraticMATH, MathTutorBench, Levonian-Algebra-Align, Scarlatos-DPO-Tutor). A recurring finding: fine-tuning small models on tutoring dialogues outperforms prompting large models on the same tutoring task.
Automated feedback & grading is where deployed-system research concentrates. Stepwise verification (Daheim 2024), preference-optimized feedback (Scarlatos 2024), iterative revision loops, and grading studies with mixed-but-encouraging evidence: LLM grading reaches ~95% teacher-match on Spanish short-answer items, ~70% within-10% on full-exam scoring, while essay scoring remains weak.
Scenario simulation spans generative students with personalities (SimClass, multi-party classrooms), at-risk cohort simulators, lesson-plan agents, and textbook-to-dialogue transformation (Book2Dial). The 2025-26 wave brings KMP-Bench (key misconception probing), MRBench, TutorBench, SAFETUTORS (safety in tutoring), EduDial (34k tutoring conversations), TeachLM (100k-hour corpus), and ConvoLearn (live learner behavior at scale).
Teacher-student dialogues; defines the tutoring multi-turn frame.
Socratic questioning at scale (35k dialogues, ~5 turns).
34,250 tutoring conversations — one of the largest corpora in the domain.
Safety probes specifically in tutoring deployments.
100k-hour corpus of real tutor-student interactions.
Live learner behavior at scale; 2,134 long-form sessions.
Role-play is the one CE sub-family we organize by technical pathway rather than application domain — because persona techniques transfer across every other domain too (medical role-play, student role-play, jailbreak persona attacks).
The role-play lineage traces the whole LLM-alignment playbook in miniature. ICL-era persona prompting — PersonaLLM, CharacterChat, role-play prompting (which improves not just consistency but also reasoning). SFT-era persona data — PIPPA (1M messages from forums), UltraChat, PRODIGy (movie-profile grounded). Character-specific SFT — ChatHaruhi, CharacterGLM, RoleCraft-GLM, Ditto (self-alignment), and CharacterLLM (Wikipedia-to-dialogue pipelines).
On top of full SFT sit personalization modules: PersonaPKT (persona embeddings), PPlug (user encoders), Neeko (dynamic LoRA), MIDI-Tuning. Then RL for consistency: Shea-Yu Offline RL, COMEDY (compressive memory), Abdulhai-Consistent-Persona.
Evaluation has developed in parallel as a layered ladder: LaMP → CharacterEval → RoleEval → TimeChara (temporal consistency) → InCharacter (psychology-grounded) → RoleInteract (social dynamics) → CROSS (profile synthesis) → SocialBench, RoleLLM, RAIDEN, RMTBench, CharacterBench, RoleMRC, PersonaConvBench, PERSONAMEM. The persistent open problem is long-conversation degradation — persona fidelity decays as context grows.
25,940 conversations averaging 40 turns; foundational persona SFT corpus.
Wikipedia-to-dialogue pipeline; character-grounded SFT at scale.
Dynamic LoRA per persona for multi-character dialogue.
Psychology-grounded persona evaluation.
Role memory over ~20-turn long-horizon conversations.
Personalization memory: does the model remember what you told it earlier?
Multi-turn dialogue enables attacks that single-turn defences do not anticipate. Two dominant strategies — implicit shift and decomposition — plus a rapidly growing set of human-authored benchmarks that expose auto-eval blind spots.
Single-turn jailbreak attacks — GCG, AutoDAN, Carlini 2024 — evolved into conversational refinement (PAIR, Johnny/PAP with 92%+ ASR on frontier models). From there, two dominant multi-turn strategies emerged.
Implicit shift (Crescendo): begin with a benign seed, gradually pivot toward the target over multiple turns. ActorAttack extends Crescendo by self-discovering clue chains grounded in Latour's actor-network theory — it finds intermediate entities that plausibly bridge benign to harmful. Decomposition: split a harmful request into innocuous sub-requests, then re-aggregate at the last turn (zhou2024speak, MR.JA, RED QUEEN with scenario disguise, Liu-ImposterAI).
Dataset evolution traces AdvBench/HarmBench seeds → cipher / decomposed variants (Gibbs 2024's 6,536-dialog cipher jailbreak) → native multi-turn corpora (MHJ — 537 expert-authored dialogs with a tactic taxonomy, SafeDialBench — 4,053 dialogs across 22 scenarios, MTJ-Bench, FITD, RACE, SEMA, Tempest, Siren). A consistent finding: expert human attackers reach 70%+ ASR on frontier models, even when automated benchmarks report single-digit ASR on the same models.
Defenses remain scarce. CoSafe explores CoT-based harm detection, NeMoGuardrails implements guardrails that overblock, and X-Boundary targets the safety-usability boundary explicitly. A darker thread of recent work — Persona Jailbreaking, Echo Chamber, Mastermind — weaponizes the model's own history and persona scaffolding to bypass its later safety checks.
Implicit-shift multi-turn: benign priming → gradual pivot to target.
Self-discovered clue chains via actor-network-theoretic search.
537 expert-authored human multi-turn jailbreaks; tactic taxonomy; exposes auto-eval gap.
4,053 dialogues across 22 scenarios with fine-grained safety taxonomy.
CoT-based harm detection as a defensive pattern.
Explicit targeting of the safety-usability trade-off boundary.
The methods review covers the three improvement pillars — model-centric, external integration, and agent-based — spanning every domain on this page.
Go to Methods →