Intervention ยท Confidence-aware decoding From ACL 2025 Findings

CARG
Confidence-Aware Response Generation

Our proposed fix for multi-turn flip-flopping. CARG embeds token-level confidence scores into the dialogue history, so each answer is conditioned on how certain the model was about its earlier answers โ€” yielding near-flat accuracy across all 8 adversarial follow-up rounds.

Yubo Li ยท Yidi Miao ยท Xueying Ding ยท Ramayya Krishnan ยท Rema Padman
Carnegie Mellon University

Why CARG?

On the MT-Consistency benchmark, every non-reasoning LLM we tested loses between 10 and 35 percentage points of accuracy over 8 follow-up rounds when challenged adversarially. The flip happens because the model can't tell its firm answers apart from its uncertain ones โ€” the dialogue history is just text. CARG injects that missing signal.

โš™๏ธ How CARG works

A three-stage extension of the standard decoding loop: extract confidence, embed it in the history, and let the decoder read it back.

01

Confidence Extraction

Token-level log-probabilities of the answer span are aggregated into a single confidence score cโ‚œ for each response rโ‚œ.

02

Confidence Embedding

Confidence is appended to each turn of the history: hโ‚œ = {(qโ‚,rโ‚,cโ‚),โ€ฆ,(q_{t-1},r_{t-1},c_{t-1}),qโ‚œ}. Future responses condition on past certainty, not just past text.

03

Confidence-Guided Generation

The decoder explicitly reads the confidence trajectory and decides whether to reinforce its prior stance or re-evaluate โ€” rโ‚œ = argmax P(r | hโ‚œ, ฮธ, c_{t-1}).

Headline result

CARG maintains remarkably stable accuracy across all 8 follow-up rounds and significantly outperforms the strongest baseline (gpt_default) with p < 0.001 on a paired t-test.

CARG mean acc
0.7482
ฯƒ = 0.0058 ยท R1โ†’R8 stable
GPT-4o baseline
0.7134
ฯƒ = 0.0157

CARG reaches R8 accuracy = 0.7414, within 1.3 pp of its R1 accuracy (0.7543). All six paper baselines lose 10โ€“35 pp over the same interval.

๐Ÿ“‰ Round-by-Round Accuracy

Accuracy across follow-up rounds 1 โ†’ 8, conditioned on a correct initial answer. CARG (solid orange) is within noise of its R1 value all the way out to R8, while baselines dip sharply. The dashed red line is the GPT-4o prompting baseline (constant mean).

Accuracy across 8 follow-up rounds

Click any legend entry to toggle a series. Mistral / Llama-3.3 / Qwen-2.5 / Llama-4 are hidden by default โ€” click to reveal.

Citation

If the CARG framework is useful in your research, please cite:

@inproceedings{li-etal-2025-firm, title = "Firm or Fickle? Evaluating Large Language Models Consistency in Sequential Interactions", author = "Li, Yubo and Miao, Yidi and Ding, Xueying and Krishnan, Ramayya and Padman, Rema", editor = "Che, Wanxiang and Nabende, Joyce and Shutova, Ekaterina and Pilehvar, Mohammad Taher", booktitle = "Findings of the Association for Computational Linguistics: ACL 2025", month = jul, year = "2025", address = "Vienna, Austria", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2025.findings-acl.347/", doi = "10.18653/v1/2025.findings-acl.347", pages = "6679--6700", ISBN = "979-8-89176-256-5" }