We organize the improvement literature into three complementary strategies. They are rarely mutually exclusive — a deployed medical agent typically combines domain SFT, retrieval over a medical KG, and an agent-level clarification policy. But each pillar has its own internal structure, and understanding them separately clarifies what a given method actually contributes.
Directly refine the LLM to handle sequential dialogue dynamics. Four sub-families: in-context learning, supervised fine-tuning, multi-turn RL, and new architectures designed for long context.
In-context learning (ICL). The simplest improvement path — shape the prompt. Surveys include Dong et al. (2022) and the long Chain-of-Thought literature (Wei 2022). An interesting multi-turn finding is that exemplar-based ICL sometimes hurts: AQA-Bench observes that adding dialogue examples distracts the model from the actual task. Recent work moves beyond raw-history prompts to structural prompting (GraphIF) and state-update prompting (Liu 2025), which explicitly represent the evolving dialogue state.
Supervised fine-tuning (SFT). The workhorse of multi-turn improvement. Two themes dominate. First, realistic data curation — Vicuna, UltraChat, PlatoLM, Parrot, and the newest wave: ConsistentChat (chen 2025), ReSURE (du 2025), data selection (Li MDS 2026), PERT (Ma 2025), DocTalk (Lee 2025). Second, optimized training — loss weighting, FlashAttention, multi-query attention, with method work extending into steering (CodeSteer). In domain work, SFT is overwhelmingly the default: every medical LLM in the survey (DISC-MedLLM, Zhongjing, BiMediX, HuatuoGPT, CPsyCoun) and most educational tutors rely on domain-SFT.
Reinforcement learning. Single-turn foundations — PPO (ouyang 2022), DPO (rafailov 2023), CaPO (sun 2024), ACT (chen 2024) — extend to genuine multi-turn RL in DMPO (shi 2024), M-DPO / M-KTO (xiong 2024), ArCHer (zhou 2024), SCoRe (kumar 2024), and REFUEL (gao 2024). Multi-turn RL benchmarks — LMRL-Gym (abdulhai 2023), SWEET-RL / ColBench (zhou 2025) — make systematic evaluation possible. Turn-level control enters with ITPO (wang 2026) and MTSA (guo 2025).
New architectures. Long-context architectures with memory or state carry-over: Cached Transformers (zhang 2024), MemBART (wu 2023), Transformer-XL (dai 2019), Recurrent Memory Transformer (bulatov 2022), HMT (he 2024), and RWKV (peng 2023, pan 2025). These differ from RAG or external memory: the state is inside the model.
Prompt-based adaptation; structural / state-update prompts beat raw history on long horizons.
Data curation + optimized training; dominant in domain work (medical, educational, roleplay).
DMPO, M-DPO/M-KTO, ArCHer, SCoRe, REFUEL — with benchmarks LMRL-Gym and SWEET-RL / ColBench.
Memory-aware / recurrent / state-carrying: RMT, HMT, RWKV, Cached Transformers.
Give the model resources outside its weights: episodic memory, retrieval, knowledge graphs. These methods shine in precisely the places model-centric ones struggle — very long horizons, fact-dense domains, personalization.
Memory-augmented methods. The newest sub-field inside this pillar. MemPrompt (madaan 2022) pioneered it; LongMemEval (wu 2024) formalized evaluation. Modern work includes episodic memory (pink 2025 position paper), MemTree (rezazadeh 2024), RMM (tan 2025), HyperMem (yue 2026), HingeMem (zhong 2026), TSUBASA (zhang 2026), and PALACE (liu 2025). What distinguishes these from RAG is that memory is accumulated across sessions, not just retrieved from a fixed corpus.
Retrieval-augmented generation (RAG). Foundations in Lewis 2020 and Wizard-of-Wikipedia (dinan 2019); DPR (karpukhin 2020), BlenderBot 2.0 broadened to dialogue. Multi-turn benchmarks now include MTRAG (katsis 2025), CORAL (cheng 2024), RAD-Bench (kuo 2025). The most interesting development is adaptive retrieval — RAGate (wang 2024) learns when to retrieve at all, DH-RAG (zhang 2025) adapts to dialogue history, CID-GraphRAG (zhu 2025) and alushi 2026 compare retrieval strategies.
Knowledge graph integration. GNN + KG embeddings for conversational augmentation (wang 2024, jain 2024, Gao 2025); dialogue-centred KGs — SURGE (kang 2023), Paths-over-Graph (tan 2025). Graph RAG variants — wang 2024 GARLIC, mavromatis 2024 GNN-RAG, yang 2025 pseudo-graph. And dialogue-conditioned KG methods: sheikhi 2025 attachment, D-SMART (lei 2025), and GAP (zhong 2025) — the latter bridging graph reasoning with dialogue plan-execute loops.
LongMemEval, MemTree, HyperMem, HingeMem, TSUBASA, PALACE — memory accumulating across sessions.
MTRAG, CORAL, RAD-Bench + adaptive variants (RAGate, DH-RAG, CID-GraphRAG).
SURGE, Paths-over-Graph, GARLIC, D-SMART, GAP — graph reasoning integrated with dialogue.
Memory, retrieval, and knowledge-graph methods are converging: the 2025-26 wave shows hybrid systems that retrieve from graphs, store in episodic memory, and dispatch retrieval decisions dynamically during dialogue.
The LLM becomes a component in a larger, proactive system: tool use, planning, memory control, or multi-agent coordination. The survey treats agents as adjacent scope, but they are essential for understanding how the best multi-turn systems are actually built.
Single-agent approaches. The canonical primitives: ReAct (yao 2023) interleaves reasoning and acting; Toolformer (schick 2023) teaches the model when to call tools; HuggingGPT (shen 2023) coordinates specialist models; Reflexion (shinn 2023) adds self-critique; Voyager (wang 2023) is the flagship open-ended agent demonstration. Evaluation lives in AgentBench (liu 2023).
Multi-agent approaches. Three architectures are emerging as patterns. Role-based: CAMEL (li 2023), ChatDev (qian 2024), SELF-COLLABORATION (dong 2024), MetaGPT (hong 2023), MoSA (yang 2025), BUTTON (chen 2024). Debate: Du-debate (du 2023), Generative Agents (park 2023). Dynamic composition: AutoAgents (chen 2024), AgentVerse (chen 2023).
The open problems in this pillar are different from the others: role misassignment, communication overhead, verification, and — a concerning emerging thread — emergent collusion among agents against safety objectives (han 2024, cemri 2025, hammond 2025). These become especially acute in multi-turn settings where agents remember prior negotiation outcomes.
ReAct, Reflexion, Toolformer, HuggingGPT, Voyager — reasoning + tool use patterns.
CAMEL, ChatDev, MetaGPT, MoSA, BUTTON — pre-assigned specialist roles.
Du-debate, Generative Agents, AutoAgents, AgentVerse — emergent structure at inference time.
Real deployments rarely live inside a single pillar. A production medical agent is likely to combine (a) domain-SFT'd base model + RLHF on clinician preferences (model-centric), (b) retrieval over a medical knowledge graph plus episodic memory of patient context (external integration), and (c) a clarification sub-agent plus a tool-use sub-agent for drug lookup (agent-based).
Similarly, a pedagogical agent might pair an SFT'd tutor model with a student-model sub-agent (simulation-based preference tuning), plus retrieval over the textbook and a safety-tutor sub-agent. The pillars are design axes, not mutually exclusive strategies. Reading the methods section as a whole helps identify which axis a given method targets — and which gaps remain.
For the open questions each pillar leaves unresolved — and the benchmarks that expose them — see the next section on open challenges.