Beyond Single-Turn: A Survey on Multi-Turn Interactions with LLMs

Why multi-turn is its own problem

The hardest part of LLM deployment happens on turn three.

Single-turn benchmarks optimize for a task that rarely exists in production. Real users clarify, backtrack, drift, and change their minds — and models that dominate one-shot leaderboards routinely collapse when those dynamics compound.

Over the last three years, multi-turn LLM research has fragmented into a constellation of sub-efforts: benchmark papers proliferating faster than any reader can absorb, method papers often measured against incomparable tasks, and domain-specific deployments (medical, educational, safety) evolving on their own protocols. The field lacks a shared frame.

This survey proposes one. We argue that real-world multi-turn interaction is best organized by task family — not by capability, not by method — because reasoning, memory, adaptation, and safety don't operate in isolation. They co-activate around a concrete conversational goal: tutor this student through a derivation, diagnose this patient without committing prematurely, help this developer debug this module. The task is the unit of analysis.

From that frame, we derive a two-family split — instruction following (explicit intent, execution precision) and conversational engagement (open intent, consultative dialogue) — and survey the benchmarks, methods, and open problems that cohere within and across them. The pages below correspond to each part of the paper and provide progressively more structure: dedicated task-family pages, a searchable benchmark explorer, and cross-cutting views of improvements and challenges.

Multi-turn is not single-turn repeated. It is a distinct regime with its own failure modes, metrics, and design targets.

Four contributions

How this survey differs from prior work.

We position the paper against capability-oriented surveys (Zhang et al., 2025), method-oriented surveys (Zhang et al., 2023), and domain surveys in healthcare — keeping what is valuable in each while re-organizing the field around the task families that best reflect real deployment.

A task-oriented organization

The primary organization is IF × CE × domain rather than capability buckets. We give each family enough detail for readers to locate their own work and understand what distinguishes it from neighbors.

Start with Instruction Following →

A unified benchmark audit

For every benchmark we record dialogue count, average turns, whether data is human- or LLM-curated, the evaluation protocol (rule / human-judge / LLM-judge), and the rubric dimensions, so readers can compare benchmarks like-for-like.

Explore benchmarks →

A three-pillar map of methods

Model-centric (ICL, SFT, multi-turn RL, new architectures), external integration (memory, RAG, KGs), and agent-based approaches — each reviewed with the multi-turn angle that prior surveys miss.

Read the methods review →

A five-area challenge roadmap

Context understanding, complex reasoning across turns, adaptation and learning, evaluation, and ethical & safety issues — with concrete examples of failure and the research threads working on each.

See open challenges →

The core split

Two task families, seven subdomains, one frame.

Instruction following tasks are judged by precision of execution — the user's intent is explicit. Conversational engagement tasks are judged by consultative competence — the user's intent is open, evolving, or only partially specified. Every domain below is surveyed in its own page.

IF

Instruction Following

Explicit intent. Judged on precision of adherence and execution across turns. Failure modes: context decay, sycophancy, poor clarification policy, forgetting distributed constraints.

General Math Coding

General IF — MT-Bench family, MT-Eval, StructFlowBench, FIRM, TURNWISE.
Math — MathDial, MINT, MathChat, MRBench, M-DPO/M-KTO.
Coding — InterCode, OpenCodeInterpreter, CodeSteer, MMSQL, MT-Sec.

Go to Instruction Following →

CE

Conversational Engagement

Open intent, consultative role. Requires proactive information seeking, synthesis across topics, use of external knowledge and tools across sustained dialogue.

Healthcare Education Role-Play Jailbreak

Healthcare — HuatuoGPT, AMIE, MediQ, HealthBench, MindEval.
Education — MathDial, SocraticLM, EduDial, SAFETUTORS, TeachLM.
Role-Play — PIPPA, CharacterLLM, CharacterBench, RMTBench.
Jailbreak — Crescendo, MHJ, ActorAttack, SafeDialBench.

Go to Conversational Engagement →

How do we make LLMs better at multi-turn?

Three methodological pillars.

The improvement landscape clusters into three complementary strategies. Most real deployments combine all three — an SFT'd domain model with retrieval over a medical KG and an agent-level clarification policy.

1

Model-Centric

Directly refine the LLM to handle sequential dialogue dynamics — in-context learning, SFT on realistic multi-turn data, multi-turn RL (DMPO, M-DPO, ArCHer), and memory-aware architectures (Recurrent Memory Transformer, HMT, RWKV).

2

External Integration

Leverage resources outside the model: episodic memory (LongMemEval, MemTree, HyperMem), retrieval augmentation (MTRAG, CORAL, RAD-Bench, RAGate), knowledge graphs (SURGE, D-SMART, GAP).

3

Agent-Based

Treat the LLM as part of a larger, proactive system: ReAct / Reflexion / Toolformer single agents; role-based multi-agent (CAMEL, ChatDev, MetaGPT); debate and dynamic composition (AutoAgents, AgentVerse).

Full methods review →

What's still hard

Five open-challenge areas.

The final third of the survey maps the open problems — not as a grab-bag list but as a structured diagnosis of where in the dialogue lifecycle things still break, and which current research threads are working on each failure.

Context Understanding

Degradation over distance; anaphora / ellipsis; clarification policy; the "lost in conversation" problem.

Complex Reasoning Across Turns

Error propagation; topic switching; proactive information seeking; multilingual / code-switching contexts.

Adaptation & Learning

Dynamic preference adaptation; persistent knowledge update; robustness to adversarial trajectories and drifting misinformation.

Evaluation

Scalable data curation; turn-level rubrics vs. trajectory-level metrics; LLM-judge biases (verbosity, self-enhancement).

Ethical & Safety Issues

Bias amplification over turns; privacy leakage via iterative probes; human–AI overtrust, emotional dependence, companion-risk.

→ Challenges page

Each challenge gets its full story: where it breaks, what the community is trying, and which benchmarks test it.

Open the roadmap →

Reference this work

Cite the survey.

If this survey or its companion website has helped your research, please cite us. The BibTeX record below is also available on the Cite page with one-click copy.

@article{li2026beyond,
  title   = {Beyond Single-Turn: A Survey on Multi-Turn Interactions with Large Language Models},
  author  = {Yubo Li and Xiaobin Shen and Yidi Miao and Xueying Ding and
             Xinyu Yao and Ramayya Krishnan and Rema Padman},
  journal = {Transactions on Machine Learning Research},
  issn    = {2835-8856},
  year    = {2026},
  url     = {https://openreview.net/forum?id=UYNQXPevpF},
  note    = {Survey Certification}
}

Beyond Single-Turn.A Survey on Multi-Turn Interactions with LLMs.