Real conversations are not single prompts. This survey maps how large language models handle extended, task-oriented dialogue — from math tutoring to medical consultation, from code refinement to adversarial jailbreaks — and distills the benchmarks, methods, and open challenges that define the frontier.
Single-turn benchmarks optimize for a task that rarely exists in production. Real users clarify, backtrack, drift, and change their minds — and models that dominate one-shot leaderboards routinely collapse when those dynamics compound.
Over the last three years, multi-turn LLM research has fragmented into a constellation of sub-efforts: benchmark papers proliferating faster than any reader can absorb, method papers often measured against incomparable tasks, and domain-specific deployments (medical, educational, safety) evolving on their own protocols. The field lacks a shared frame.
This survey proposes one. We argue that real-world multi-turn interaction is best organized by task family — not by capability, not by method — because reasoning, memory, adaptation, and safety don't operate in isolation. They co-activate around a concrete conversational goal: tutor this student through a derivation, diagnose this patient without committing prematurely, help this developer debug this module. The task is the unit of analysis.
From that frame, we derive a two-family split — instruction following (explicit intent, execution precision) and conversational engagement (open intent, consultative dialogue) — and survey the benchmarks, methods, and open problems that cohere within and across them. The pages below correspond to each part of the paper and provide progressively more structure: dedicated task-family pages, a searchable benchmark explorer, and cross-cutting views of improvements and challenges.
We position the paper against capability-oriented surveys (Zhang et al., 2025), method-oriented surveys (Zhang et al., 2023), and domain surveys in healthcare — keeping what is valuable in each while re-organizing the field around the task families that best reflect real deployment.
The primary organization is IF × CE × domain rather than capability buckets. We give each family enough detail for readers to locate their own work and understand what distinguishes it from neighbors.
Start with Instruction Following →For every benchmark we record dialogue count, average turns, whether data is human- or LLM-curated, the evaluation protocol (rule / human-judge / LLM-judge), and the rubric dimensions, so readers can compare benchmarks like-for-like.
Explore benchmarks →Model-centric (ICL, SFT, multi-turn RL, new architectures), external integration (memory, RAG, KGs), and agent-based approaches — each reviewed with the multi-turn angle that prior surveys miss.
Read the methods review →Context understanding, complex reasoning across turns, adaptation and learning, evaluation, and ethical & safety issues — with concrete examples of failure and the research threads working on each.
See open challenges →Instruction following tasks are judged by precision of execution — the user's intent is explicit. Conversational engagement tasks are judged by consultative competence — the user's intent is open, evolving, or only partially specified. Every domain below is surveyed in its own page.
Explicit intent. Judged on precision of adherence and execution across turns. Failure modes: context decay, sycophancy, poor clarification policy, forgetting distributed constraints.
Open intent, consultative role. Requires proactive information seeking, synthesis across topics, use of external knowledge and tools across sustained dialogue.
The improvement landscape clusters into three complementary strategies. Most real deployments combine all three — an SFT'd domain model with retrieval over a medical KG and an agent-level clarification policy.
Directly refine the LLM to handle sequential dialogue dynamics — in-context learning, SFT on realistic multi-turn data, multi-turn RL (DMPO, M-DPO, ArCHer), and memory-aware architectures (Recurrent Memory Transformer, HMT, RWKV).
Leverage resources outside the model: episodic memory (LongMemEval, MemTree, HyperMem), retrieval augmentation (MTRAG, CORAL, RAD-Bench, RAGate), knowledge graphs (SURGE, D-SMART, GAP).
Treat the LLM as part of a larger, proactive system: ReAct / Reflexion / Toolformer single agents; role-based multi-agent (CAMEL, ChatDev, MetaGPT); debate and dynamic composition (AutoAgents, AgentVerse).
The final third of the survey maps the open problems — not as a grab-bag list but as a structured diagnosis of where in the dialogue lifecycle things still break, and which current research threads are working on each failure.
Degradation over distance; anaphora / ellipsis; clarification policy; the "lost in conversation" problem.
Error propagation; topic switching; proactive information seeking; multilingual / code-switching contexts.
Dynamic preference adaptation; persistent knowledge update; robustness to adversarial trajectories and drifting misinformation.
Scalable data curation; turn-level rubrics vs. trajectory-level metrics; LLM-judge biases (verbosity, self-enhancement).
Bias amplification over turns; privacy leakage via iterative probes; human–AI overtrust, emotional dependence, companion-risk.
Each challenge gets its full story: where it breaks, what the community is trying, and which benchmarks test it.
Open the roadmap →If this survey or its companion website has helped your research, please cite us. The BibTeX record below is also available on the Cite page with one-click copy.
@article{li2025beyond,
title = {Beyond Single-Turn: A Survey on Multi-Turn Interactions with Large Language Models},
author = {Li, Yubo and Shen, Xiaobin and Yao, Xinyu and Ding, Xueying and
Miao, Yidi and Krishnan, Ramayya and Padman, Rema},
journal = {arXiv preprint arXiv:2504.04717},
year = {2025}
}