Confidence-Based Risk Management | Awesome LLM Confidence

Overview

Deployment requires managing risk: detecting when models are wrong or hallucinating, quantifying uncertainty with formal guarantees, and deciding when to abstain rather than guess. Confidence is the primary signal for all three.

Three complementary objectives drive this section: (1) calibration — confidence should reflect true accuracy, (2) selective reliability — using confidence to flag uncertain predictions, and (3) conformal coverage — formal guarantees that predictions or abstentions cover the truth.

Actionable Signals

Before detecting failures, we must first elicit reliable confidence signals from models.

Language models (mostly) know what they know

Kadavath et al. · 2022

Asks models directly for P(True) or P(I know), using verbal confidence as a reliability signal.

Self Unit: answer/query Role: elicit Access: Pr

Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback

Tian et al. · EMNLP 2023

Elicits verbal confidence statements; calibrates them to predict accuracy on held-out tasks.

Self Unit: answer Role: elicit, calibrate Access: Pr

Calibrating Verbal Uncertainty as a Linear Feature to Reduce Hallucinations

Ji et al. · 2025

Detects mismatch between verbalized uncertainty and semantic representations in hidden states.

Mechanistic Unit: hidden state Role: detect, steer Access: WB

Black-Box Hallucination Detection via Consistency Under the Uncertain Expression

Joo et al. · 2025

Measures consistency between uncertain expressions in text and token-level confidence signals.

Self Unit: answer Role: detect Access: BB

Hallucination Detection

Using confidence-based signals to detect when models generate false or unsupported claims.

Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models

Manakul et al. · EMNLP 2023

Samples multiple responses and flags sentences where agreement is low as potentially hallucinated.

Self Unit: sentence Role: detect Access: BB, MS

Detecting hallucinations in large language models using semantic entropy

Farquhar et al. · Nature 2024

Clusters semantic meanings of samples and uses entropy over clusters to detect hallucinations.

Self Unit: answer set Role: detect Access: MS

Semantic Entropy Probes: Robust and Cheap Hallucination Detection in LLMs

Han et al. · ICML Workshop 2024

Uses mechanistic probes to compute semantic entropy without sampling; detects uncertainty efficiently.

Mechanistic Unit: hidden state Role: detect Access: WB

INSIDE: LLMs' Internal States Retain the Power of Hallucination Detection

Chen et al. · ICLR 2024

Uses eigenvalue-based scoring of activation matrices to detect hallucinations without sampling.

Mechanistic Unit: hidden state Role: detect Access: WB

The Internal State of an LLM Knows When It's Lying

Azaria & Mitchell · EMNLP Findings 2023

Trains probes on internal states to detect whether models believe their own statements (truthfulness).

Mechanistic Unit: statement Role: detect Access: WB

Conformal Guarantees

Formal coverage guarantees: confidence-based prediction sets or filtering that maintain user-specified error rates.

Conformal prediction with large language models for multi-choice question answering

Kumar et al. · 2023

Applies conformal prediction to multiple-choice QA; outputs sets of answers with coverage guarantees.

Self Unit: answer set Role: set-predict Access: CP

Conformal Language Modeling

Quach et al. · ICLR 2024

Uses conformal prediction with sampling and rejection to construct output sets with formal coverage.

Self Unit: output set Role: set-predict, filter Access: MS, CP

Language Models with Conformal Factuality Guarantees

Mohri & Hashimoto · ICML 2024

Filters claims using conformal entailment scores; guarantees factuality at desired coverage level.

Hybrid Unit: claim Role: filter, back off Access: CP

Large language model validity via enhanced conformal prediction methods

Cherian et al. · NeurIPS 2024

Extends conformal prediction with conditional guarantees on claim-level factuality.

Hybrid Unit: claim Role: filter Access: CP

Conformal Language Model Reasoning with Coherent Factuality

Rubin-Toles et al. · ICLR 2025

Applies conformal prediction to claim graphs, maintaining coherence and coverage guarantees.

Hybrid Unit: claim graph Role: filter Access: CP

API Is Enough: Conformal Prediction for Large Language Models Without Logit-Access

Su et al. · EMNLP Findings 2024

Implements conformal prediction using only API access by leveraging sample frequency and similarity.

Hybrid Unit: answer set Role: set-predict Access: API, MS, CP

Does confidence calibration improve conformal prediction?

Xi et al. · TMLR 2025

Calibrates temperature scaling via conformal methods, ensuring prediction set coverage.

Self Unit: prediction set Role: calibrate Access: CP

Abstention & Alignment

Training or adapting models to abstain when uncertain, improving reliability and alignment with human expectations.

Self-Evaluation: Token Self-Eval

Ren et al. · NeurIPS Workshop 2023

Models learn to evaluate their own answers token-by-token, signaling abstention when unsure.

Self Unit: answer Role: abstain, select Access: Pr

Adaptation with self-evaluation to improve selective prediction in llms

Chen et al. · EMNLP Findings 2023

Fine-tunes models to improve self-evaluation signals specific to a task, enabling better abstention.

Self Unit: answer Role: abstain Access: PE

Selectively answering ambiguous questions

Cole et al. · EMNLP 2023

Uses repetition and multi-sample agreement to detect ambiguous questions; abstains for uncertain cases.

Self Unit: query/answer set Role: abstain Access: MS

R-Tuning: Instructing Large Language Models to Say `I Don't Know`

Zhang et al. · NAACL 2024

Fine-tunes models to recognize knowledge boundaries; abstains for out-of-distribution queries.

Self Unit: query/answer Role: abstain, align Access: FT

Fine-Tuning Large Language Models to Appropriately Abstain with Semantic Entropy

Tjandra et al. · NeurIPS Workshop 2024

Fine-tunes models to produce high semantic entropy when uncertain, enabling abstention.

Self Unit: answer Role: abstain, align Access: FT, MS

ConfTuner: Training Large Language Models to Express Their Confidence Verbally

Li et al. · NeurIPS 2025

Calibrates verbal confidence via token-level Brier score loss, improving alignment with true accuracy.

Self Unit: answer Role: calibrate Access: FT

Teaching LLMs to Abstain via Fine-Grained Semantic Confidence Reward

An & Xu · 2025

Uses RL with semantic clustering rewards to train models that abstain when sample consensus is low.

Self Unit: answer sample Role: abstain, align Access: RL, MS

Confidence-Based Response Abstinence: Improving LLM Trustworthiness via Activation-Based Uncertainty Estimation

Huang et al. · UncertaiNLP 2025

Uses feed-forward network activation patterns to detect uncertainty; enables abstention without retraining.

Mechanistic Unit: answer Role: abstain Access: WB

Summary Table

Method	Source	Signal	Unit	Role	Access
Actionable Signals
P(True) / P(IK)	Self	Verbal confidence	Answer/query	Elicit	Pr
Just Ask	Self	Verbal confidence	Answer	Elicit, calibrate	Pr
VUF / MUC	Mechanistic	Verbal-semantic mismatch	Hidden state	Detect, steer	WB
UExpr Consistency	Self	Expression consistency	Answer	Detect	BB
Hallucination Detection
SelfCheckGPT	Self	Sample agreement	Sentence	Detect	BB, MS
Semantic Entropy	Self	Semantic clustering entropy	Answer set	Detect	MS
SEPs	Mechanistic	Probed entropy	Hidden state	Detect	WB
INSIDE	Mechanistic	EigenScore	Hidden state	Detect	WB
Internal State	Mechanistic	Truth probe	Statement	Detect	WB
Conformal Guarantees
MCQA CP	Self	Option nonconformity	Answer set	Set-predict	CP
Conformal LM	Self	Sampling + rejection	Output set	Set-predict, filter	MS, CP
Conformal Factuality	Hybrid	Entailment scores	Claim	Filter, back off	CP
Enhanced CP	Hybrid	Conditional factuality	Claim	Filter	CP
Coherent Factuality	Hybrid	Graph conformal	Claim graph	Filter	CP
API-Only CP	Hybrid	Sample frequency	Answer set	Set-predict	API, MS, CP
ConfTS	Self	Conformal calibration	Prediction set	Calibrate	CP
Abstention & Alignment
Self-Evaluation	Self	Token self-eval	Answer	Abstain, select	Pr
Adapt with Self-Eval	Self	Adapted self-eval	Answer	Abstain	PE
Selective Ambiguous QA	Self	Repetition/agreement	Query/answer set	Abstain	MS
R-Tuning	Self	Knowledge boundary	Query/answer	Abstain, align	FT
SemEnt FT	Self	Semantic entropy	Answer	Abstain, align	FT, MS
ConfTuner	Self	Verbal confidence	Answer	Calibrate	FT
FISCORE	Self	Semantic consensus	Answer sample	Abstain, align	RL, MS
Activation-Based Abstention	Mechanistic	FFN activation	Answer	Abstain	WB

Access legend: Pr = prompt, BB = black-box, PE = prompt engineering, FT = fine-tuned, WB = white-box, MS = multiple samples, CP = conformal prediction, RL = reinforcement learning, API = API access

Discussion

Risk management with confidence has three complementary objectives that are often in tension:

Calibration: Confidence should match true accuracy. Methods like Just Ask, ConfTuner, and ConfTS aim for this directly. But calibration alone doesn't prevent mistakes on OOD data.
Selective reliability: Flag uncertain predictions and abstract from them. SelfCheckGPT, Semantic Entropy, and R-Tuning excel here. But flagging too many cases reduces utility.
Conformal coverage: Formal guarantees (e.g., "this set contains the truth with probability ≥ 0.9") independent of data distribution. Conformal methods provide this. But they require well-calibrated nonconformity measures.

Best practice combines these: (1) calibrate confidence via fine-tuning or prompt-based elicitation, (2) detect hallucinations via semantic entropy or probes, (3) apply conformal filtering for formal coverage guarantees, and (4) train models to abstain gracefully when uncertain. No single confidence source handles all three objectives equally.