Our proposed fix for multi-turn flip-flopping. CARG embeds token-level confidence scores into the dialogue history, so each answer is conditioned on how certain the model was about its earlier answers โ yielding near-flat accuracy across all 8 adversarial follow-up rounds.
On the MT-Consistency benchmark, every non-reasoning LLM we tested loses between 10 and 35 percentage points of accuracy over 8 follow-up rounds when challenged adversarially. The flip happens because the model can't tell its firm answers apart from its uncertain ones โ the dialogue history is just text. CARG injects that missing signal.
A three-stage extension of the standard decoding loop: extract confidence, embed it in the history, and let the decoder read it back.
Token-level log-probabilities of the answer span are aggregated into a single confidence score cโ for each response rโ.
Confidence is appended to each turn of the history: hโ = {(qโ,rโ,cโ),โฆ,(q_{t-1},r_{t-1},c_{t-1}),qโ}. Future responses condition on past certainty, not just past text.
The decoder explicitly reads the confidence trajectory and decides whether to reinforce its prior stance or re-evaluate โ rโ = argmax P(r | hโ, ฮธ, c_{t-1}).
CARG maintains remarkably stable accuracy across all 8 follow-up rounds and significantly outperforms the strongest baseline (gpt_default) with p < 0.001 on a paired t-test.
CARG reaches R8 accuracy = 0.7414, within 1.3 pp of its R1 accuracy (0.7543). All six paper baselines lose 10โ35 pp over the same interval.
Accuracy across follow-up rounds 1 โ 8, conditioned on a correct initial answer. CARG (solid orange) is within noise of its R1 value all the way out to R8, while baselines dip sharply. The dashed red line is the GPT-4o prompting baseline (constant mean).
Click any legend entry to toggle a series. Mistral / Llama-3.3 / Qwen-2.5 / Llama-4 are hidden by default โ click to reveal.
If the CARG framework is useful in your research, please cite:
@inproceedings{li-etal-2025-firm,
title = "Firm or Fickle? Evaluating Large Language Models Consistency in Sequential Interactions",
author = "Li, Yubo and
Miao, Yidi and
Ding, Xueying and
Krishnan, Ramayya and
Padman, Rema",
editor = "Che, Wanxiang and
Nabende, Joyce and
Shutova, Ekaterina and
Pilehvar, Mohammad Taher",
booktitle = "Findings of the Association for Computational Linguistics: ACL 2025",
month = jul,
year = "2025",
address = "Vienna, Austria",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.findings-acl.347/",
doi = "10.18653/v1/2025.findings-acl.347",
pages = "6679--6700",
ISBN = "979-8-89176-256-5"
}