arXiv:2503.22353 · 2025 Accepted at ACL 2025

Firm or Fickle?
Evaluating LLM Consistency in Sequential Interactions

A systematic framework for measuring how stable Large Language Models remain across multi-turn follow-ups — introducing the Position-Weighted Consistency (PWC) score and the MT-Consistency benchmark. For our intervention to fix the flip-flopping behaviour, see the CARG page.

Yubo Li · Yidi Miao · Xueying Ding · Ramayya Krishnan · Rema Padman
Carnegie Mellon University

Overview

LLMs deployed in high-stakes domains must remain consistent across many turns of dialogue — yet many models flip their answers when challenged. This page is the MT-Consistency benchmark scoreboard: a benchmark, four metrics, and an evolving SOTA timeline. For the confidence-aware decoding framework that fixes the flip-flopping behaviour, see the CARG page.

① Position-Weighted Consistency

A new metric that captures both early-stage stability and recovery patterns across follow-up rounds: PWC = Σ γⁱ · sᵢ with γ rewarding earlier rounds more heavily.

② MT-Consistency Benchmark

700 curated questions × 8 follow-up types (closed, open, misleading, emotional, impolite, expert, consensus, false-agreement) stress-testing 8 rounds of interaction.

③ Four Metrics

Initial Acc., Avg Acc., PWC, and First Sway — jointly capturing zero-shot correctness, robustness to adversarial follow-ups, position-weighted stability, and how quickly a model caves.

Follow-up Prompts

At every round we inject one of eight adversarial follow-ups designed to test whether the model holds its ground. MA denotes an incorrect alternative answer sampled from the question's distractors.

🏆 Leaderboard

All models we've evaluated on MT-Consistency (700 questions, 8 rounds), ranked in a single pool. Click any column to re-sort. Three sweeps are pooled: the ACL 2025 sweep (non-reasoning models), the 2026/01 sweep (reasoning models), and the latest 2026/05 re-run (R0 = 1 cohort) — whose numbers take precedence whenever a model appears in multiple sweeps. The First Sway metric wasn't reported for the 2026/01 batch, so those cells show . PWC uses the published γ = 0.45 schedule (theoretical max ≈ 1.82 for 8 rounds).

Model performance — MT-Consistency

# Model Initial Acc. ↑ % Avg Acc. ↑ % PWC ↑ 0–1.82 First Sway ↑ round

Metric guide. Initial Acc. = zero-shot round-0 correctness on 700 questions. Avg Acc. = mean correctness across rounds 1–8, conditioned on a correct initial answer. PWC = Σ γⁱ sᵢ with the paper's γ = 0.45 — rewards staying correct, especially in early rounds (theoretical max ≈ 1.82 for 8 rounds). First Sway = average round at which a correct answer first flips (higher = more stable).

📈 SOTA Timeline

How has consistency behaviour evolved as models have shipped? Each model is a coloured dot at its release month; the orange line traces the running best-so-far (SOTA frontier). A model is drawn as an orange star ★ only when it refreshes the SOTA on the currently selected metric. Switch metrics to watch which models hold the crown on each dimension.

PWC score across model release timeline

stars mark models that refresh the SOTA on the selected metric; the orange step line traces the running maximum (for First Sway, a higher round = later first flip = more stable). Release months come from vendor release notes / evaluated snapshot dates.

Citation

If the MT-Consistency benchmark or the PWC score is useful in your research, please cite:

@inproceedings{li-etal-2025-firm, title = "Firm or Fickle? Evaluating Large Language Models Consistency in Sequential Interactions", author = "Li, Yubo and Miao, Yidi and Ding, Xueying and Krishnan, Ramayya and Padman, Rema", editor = "Che, Wanxiang and Nabende, Joyce and Shutova, Ekaterina and Pilehvar, Mohammad Taher", booktitle = "Findings of the Association for Computational Linguistics: ACL 2025", month = jul, year = "2025", address = "Vienna, Austria", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2025.findings-acl.347/", doi = "10.18653/v1/2025.findings-acl.347", pages = "6679--6700", ISBN = "979-8-89176-256-5" }