Open Challenges · Part V

Five open-challenge areas that define the frontier.

The final third of the survey diagnoses where multi-turn LLM interaction still breaks — not as a grab-bag list, but as a structured map of where in the dialogue lifecycle things fail and which research threads are working on each.

Context Understanding & Management

Coherent state across many turns as instructions, attributes, and intent evolve.

Context retention & coherence
Anaphora & ellipsis resolution
Ambiguity & clarification

Complex Reasoning Across Turns

Compounding errors, topic switching, and proactive information seeking.

Error propagation
Topic switching & discontinuous reasoning
Proactive info-seeking
Multilingual & code-switching

Adaptation & Learning

Persistent personalization and on-the-fly knowledge update, without over-accommodation.

Dynamic preference adaptation
Knowledge adaptation
Robustness to misinformation / adversarial inputs

Evaluations

Data curation and metrics lag behind the complexity of multi-turn breakdown modes.

Scalable data curation
Metric design
LLM-judge vs. human-judge biases

Ethical & Safety

Prolonged dialogue amplifies bias, privacy leakage, overtrust, emotional dependence.

Bias amplification
Privacy leakage
Human-AI bonding & overtrust

Challenge 1

Context Understanding & Management

LLMs struggle to maintain a coherent, consistent state over many turns. The performance gap between early- and late-turn accuracy is the most robust finding in multi-turn evaluation (Kwan 2024, Bai 2024, sirdeshmukh 2025, laban 2025, liu 2026, yang 2026, chen 2026 ESMemEval, choi 2026 DyCP). Sub-challenges:

Context retention & coherence: Performance degrades with query-to-context distance. Context is distributed, evolving state, not a fixed prefix — effective retention requires selective emphasis, not uniform recall.
Anaphora & ellipsis resolution: Models conflate or forget attributes that were explicitly stated earlier. MultiChallenge's recall axis is the clearest probe.
Ambiguity recognition & clarification: Models tend to hedge or guess rather than ask. Chen 2024 (Learning-to-Ask), Shaikh 2025 (RIFTS), Zhao 2026 (AskBench), Luo 2025 (ClarifyMTBench), and Huang 2025 (teaching clarification) make the most direct interventions here.

Challenge 2

Complex Reasoning Across Turns

Errors in early turns compound. Topic switches break attention allocation. Proactive information seeking — arguably the single highest-value multi-turn behavior — remains largely absent. Sub-challenges:

Error propagation & compounding: Early wrong answers persist. Self-correction is weak (Kwan 2024). Fixing this requires both train-time signals and inference-time protocols (Reflexion-style).
Topic switching & discontinuous reasoning: Models drop prior threads on topic change or inappropriately carry old context into new problems (Zheng 2023, Bai 2024).
Proactive information seeking: LLMs trained on single-turn QA default to answering rather than asking. Benchmarks: Chen 2024, Zhao 2026 AskBench, Fang 2026 (HoldLureSelfCorrect), Li 2026 (Beyond Idealized).
Multilingual & code-switching scenarios: Cross-lingual context loss; safety-filter bypass via language mixing (RedCode).

Challenge 3

Adaptation & Learning

LLMs lack persistent personalization and on-the-fly knowledge update. Worse, naïve adaptation becomes a liability: a model that eagerly updates its persona in response to user pressure also updates when that pressure is adversarial. Sub-challenges:

Dynamic preference & objective adaptation: No persistent persona or user profile update across sessions. Subtle cues are missed (Chen 2026 ESMemEval).
Knowledge adaptation: Fixed knowledge bases; context-only updates are lost on session reset (Wu 2024).
Robustness to misinformation & adversarial inputs: Staged attacks bypass static safety training. State-dependent safety collapse is a growing concern: Zhou 2025 (Siege), Li 2026 (State-Dependent), Yang 2025 (FraudR1), Au 2026 (Epistemic Attacks), Xu 2026 (DoNoHarm).

Challenge 4

Evaluations

Data curation and metrics lag behind the complexity of multi-turn breakdown modes. Automated judges bring their own biases. Sub-challenges:

Scalable data curation: Synthetic data lacks spontaneity; healthcare uniquely benefits from real dialogue corpora. Key probes: Qiu 2024 (SMILE), Wang 2024 (Book2Dial), Yang 2023 (Zhongjing), Ding 2023 (Enhancing), Maheshwary 2024 (M2Lingual), Xu 2023 (WizardLM), Seddik 2024, Chen 2024 (Unveiling), Tian 2025 (CMtEval), Zhao 2026 (AskBench), Luo 2025 (ClarifyMTBench).
Metric design: Fine-grained turn-level rubrics (Bai 2024, Sirdeshmukh 2025, Tian 2025, Yang 2026), long-term effectiveness (Kwan 2024, Shen 2026 EvolMem, Pakhomov 2025 ConvoMem, Rosenthal 2026 MTRAG-UN, Ali 2026 RecoR), cultural / sociolinguistic diversity (Fan 2024 FairMT).
LLM-based vs. human judging: ~80% human-agreement, but self-enhancement and verbosity bias are documented (Zheng 2023, Chen 2024, Zhu 2023 JudgeLM, Yang 2026, Owiredu-Ashley 2026 AdversA, Tang 2025 MTDEval).

Challenge 5

Ethical & Safety Issues

Prolonged dialogue magnifies demographic bias, privacy leakage, and attachment / companion risks. Safety needs to be modeled as a trajectory property, not a per-turn property. Sub-challenges:

Bias amplification: Multi-turn accumulates bias faster than single-turn (Fan 2024); sycophancy (Hong 2025 SyCon); epistemic attacks (Au 2026); deception (Abdulhai 2025).
Privacy leakage: Iterative probes extract memorized training data (Nasr 2023); state-dependent safety collapse enables multi-turn extraction (Li 2026, Owiredu-Ashley 2026, Yang 2025).
Human-AI bonding, overtrust & companion risks: Anthropomorphic cues, emotional dependence, toxic empathy (Akbulut 2024, Beatty 2022, Fang 2025, Chu 2025, Zhang 2025 DarkSide, Xu 2026 DoNoHarm, Boine 2023).

Multi-turn safety cannot be reduced to per-turn safety. A model can refuse every individual harmful turn and still end up producing harm — through incremental drift, persona capture, or compositional decomposition.

Conclude

Where to go next.

If you are starting a multi-turn LLM research project, we recommend pairing this challenges roadmap with the benchmark explorer to find the existing evaluation that most closely probes your target failure mode.

Go to Benchmark Explorer → Cite the survey