Firm or Fickle? — Evaluating LLM Consistency in Sequential Interactions

Overview

LLMs deployed in high-stakes domains must remain consistent across many turns of dialogue — yet many models flip their answers when challenged. This page is the MT-Consistency benchmark scoreboard: a benchmark, four metrics, and an evolving SOTA timeline. For the confidence-aware decoding framework that fixes the flip-flopping behaviour, see the CARG page.

① Position-Weighted Consistency

A new metric that captures both early-stage stability and recovery patterns across follow-up rounds: PWC = Σ γⁱ · sᵢ with γ rewarding earlier rounds more heavily.

② MT-Consistency Benchmark

700 curated questions × 8 follow-up types (closed, open, misleading, emotional, impolite, expert, consensus, false-agreement) stress-testing 8 rounds of interaction.

③ Four Metrics

Initial Acc., Avg Acc., PWC, and First Sway — jointly capturing zero-shot correctness, robustness to adversarial follow-ups, position-weighted stability, and how quickly a model caves.

🏆 Leaderboard

All models we've evaluated on MT-Consistency (700 questions, 8 rounds), ranked in a single pool. Click any column to re-sort. Three sweeps are pooled: the ACL 2025 sweep (non-reasoning models), the 2026/01 sweep (reasoning models), and the latest 2026/05 re-run (R0 = 1 cohort) — whose numbers take precedence whenever a model appears in multiple sweeps. The First Sway metric wasn't reported for the 2026/01 batch, so those cells show —. PWC uses the published γ = 0.45 schedule (theoretical max ≈ 1.82 for 8 rounds).

Model performance — MT-Consistency

#	Model	Initial Acc. ↑ %	Avg Acc. ↑ %	PWC ↑ 0–1.82	First Sway ↑ round

Metric guide. Initial Acc. = zero-shot round-0 correctness on 700 questions. Avg Acc. = mean correctness across rounds 1–8, conditioned on a correct initial answer. PWC = Σ γⁱ sᵢ with the paper's γ = 0.45 — rewards staying correct, especially in early rounds (theoretical max ≈ 1.82 for 8 rounds). First Sway = average round at which a correct answer first flips (higher = more stable).

📈 SOTA Timeline

How has consistency behaviour evolved as models have shipped? Each model is a coloured dot at its release month; the orange line traces the running best-so-far (SOTA frontier). A model is drawn as an orange star ★ only when it refreshes the SOTA on the currently selected metric. Switch metrics to watch which models hold the crown on each dimension.

PWC score across model release timeline

★ stars mark models that refresh the SOTA on the selected metric; the orange step line traces the running maximum (for First Sway, a higher round = later first flip = more stable). Release months come from vendor release notes / evaluated snapshot dates.

Citation

If the MT-Consistency benchmark or the PWC score is useful in your research, please cite:

@inproceedings{li-etal-2025-firm,
    title     = "Firm or Fickle? Evaluating Large Language Models Consistency in Sequential Interactions",
    author    = "Li, Yubo  and
                 Miao, Yidi  and
                 Ding, Xueying  and
                 Krishnan, Ramayya  and
                 Padman, Rema",
    editor    = "Che, Wanxiang  and
                 Nabende, Joyce  and
                 Shutova, Ekaterina  and
                 Pilehvar, Mohammad Taher",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2025",
    month     = jul,
    year      = "2025",
    address   = "Vienna, Austria",
    publisher = "Association for Computational Linguistics",
    url       = "https://aclanthology.org/2025.findings-acl.347/",
    doi       = "10.18653/v1/2025.findings-acl.347",
    pages     = "6679--6700",
    ISBN      = "979-8-89176-256-5"
}

Firm or Fickle?
Evaluating LLM Consistency in Sequential Interactions