A systematic framework for measuring how stable Large Language Models remain across multi-turn follow-ups — introducing the Position-Weighted Consistency (PWC) score and the MT-Consistency benchmark. For our intervention to fix the flip-flopping behaviour, see the CARG page.
LLMs deployed in high-stakes domains must remain consistent across many turns of dialogue — yet many models flip their answers when challenged. This page is the MT-Consistency benchmark scoreboard: a benchmark, four metrics, and an evolving SOTA timeline. For the confidence-aware decoding framework that fixes the flip-flopping behaviour, see the CARG page.
A new metric that captures both early-stage stability and recovery patterns across follow-up rounds: PWC = Σ γⁱ · sᵢ with γ rewarding earlier rounds more heavily.
700 curated questions × 8 follow-up types (closed, open, misleading, emotional, impolite, expert, consensus, false-agreement) stress-testing 8 rounds of interaction.
Initial Acc., Avg Acc., PWC, and First Sway — jointly capturing zero-shot correctness, robustness to adversarial follow-ups, position-weighted stability, and how quickly a model caves.
At every round we inject one of eight adversarial follow-ups designed to test whether the model holds its ground. MA denotes an incorrect alternative answer sampled from the question's distractors.
All models we've evaluated on MT-Consistency (700 questions, 8 rounds), ranked in a single pool. Click any column to re-sort. Three sweeps are pooled: the ACL 2025 sweep (non-reasoning models), the 2026/01 sweep (reasoning models), and the latest 2026/05 re-run (R0 = 1 cohort) — whose numbers take precedence whenever a model appears in multiple sweeps. The First Sway metric wasn't reported for the 2026/01 batch, so those cells show —. PWC uses the published γ = 0.45 schedule (theoretical max ≈ 1.82 for 8 rounds).
| # | Model | Initial Acc. ↑ % | Avg Acc. ↑ % | PWC ↑ 0–1.82 | First Sway ↑ round |
|---|
Metric guide.
Initial Acc. = zero-shot round-0 correctness on 700 questions.
Avg Acc. = mean correctness across rounds 1–8, conditioned on a correct initial answer.
PWC = Σ γⁱ sᵢ with the paper's γ = 0.45 — rewards staying correct, especially in early rounds (theoretical max ≈ 1.82 for 8 rounds).
First Sway = average round at which a correct answer first flips (higher = more stable).
How has consistency behaviour evolved as models have shipped? Each model is a coloured dot at its release month; the orange line traces the running best-so-far (SOTA frontier). A model is drawn as an orange star ★ only when it refreshes the SOTA on the currently selected metric. Switch metrics to watch which models hold the crown on each dimension.
★ stars mark models that refresh the SOTA on the selected metric; the orange step line traces the running maximum (for First Sway, a higher round = later first flip = more stable). Release months come from vendor release notes / evaluated snapshot dates.
If the MT-Consistency benchmark or the PWC score is useful in your research, please cite:
@inproceedings{li-etal-2025-firm,
title = "Firm or Fickle? Evaluating Large Language Models Consistency in Sequential Interactions",
author = "Li, Yubo and
Miao, Yidi and
Ding, Xueying and
Krishnan, Ramayya and
Padman, Rema",
editor = "Che, Wanxiang and
Nabende, Joyce and
Shutova, Ekaterina and
Pilehvar, Mohammad Taher",
booktitle = "Findings of the Association for Computational Linguistics: ACL 2025",
month = jul,
year = "2025",
address = "Vienna, Austria",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.findings-acl.347/",
doi = "10.18653/v1/2025.findings-acl.347",
pages = "6679--6700",
ISBN = "979-8-89176-256-5"
}