CARG — Confidence-Aware Response Generation

Why CARG?

On the MT-Consistency benchmark, every non-reasoning LLM we tested loses between 10 and 35 percentage points of accuracy over 8 follow-up rounds when challenged adversarially. The flip happens because the model can't tell its firm answers apart from its uncertain ones — the dialogue history is just text. CARG injects that missing signal.

⚙️ How CARG works

A three-stage extension of the standard decoding loop: extract confidence, embed it in the history, and let the decoder read it back.

Confidence Extraction

Token-level log-probabilities of the answer span are aggregated into a single confidence score cₜ for each response rₜ.

Confidence Embedding

Confidence is appended to each turn of the history: hₜ = {(q₁,r₁,c₁),…,(q_{t-1},r_{t-1},c_{t-1}),qₜ}. Future responses condition on past certainty, not just past text.

Confidence-Guided Generation

The decoder explicitly reads the confidence trajectory and decides whether to reinforce its prior stance or re-evaluate — rₜ = argmax P(r | hₜ, θ, c_{t-1}).

Headline result

CARG maintains remarkably stable accuracy across all 8 follow-up rounds and significantly outperforms the strongest baseline (gpt_default) with p < 0.001 on a paired t-test.

CARG mean acc

0.7482

σ = 0.0058 · R1→R8 stable

GPT-4o baseline

0.7134

σ = 0.0157

CARG reaches R8 accuracy = 0.7414, within 1.3 pp of its R1 accuracy (0.7543). All six paper baselines lose 10–35 pp over the same interval.

📉 Round-by-Round Accuracy

Accuracy across follow-up rounds 1 → 8, conditioned on a correct initial answer. CARG (solid orange) is within noise of its R1 value all the way out to R8, while baselines dip sharply. The dashed red line is the GPT-4o prompting baseline (constant mean).

Accuracy across 8 follow-up rounds

Click any legend entry to toggle a series. Mistral / Llama-3.3 / Qwen-2.5 / Llama-4 are hidden by default — click to reveal.

Citation

If the CARG framework is useful in your research, please cite:

@inproceedings{li-etal-2025-firm,
    title     = "Firm or Fickle? Evaluating Large Language Models Consistency in Sequential Interactions",
    author    = "Li, Yubo  and
                 Miao, Yidi  and
                 Ding, Xueying  and
                 Krishnan, Ramayya  and
                 Padman, Rema",
    editor    = "Che, Wanxiang  and
                 Nabende, Joyce  and
                 Shutova, Ekaterina  and
                 Pilehvar, Mohammad Taher",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2025",
    month     = jul,
    year      = "2025",
    address   = "Vienna, Austria",
    publisher = "Association for Computational Linguistics",
    url       = "https://aclanthology.org/2025.findings-acl.347/",
    doi       = "10.18653/v1/2025.findings-acl.347",
    pages     = "6679--6700",
    ISBN      = "979-8-89176-256-5"
}