Confidence-Based Risk Management

Detection, coverage, and abstention for reliable LLM deployment (§7)

Overview

Deployment requires managing risk: detecting when models are wrong or hallucinating, quantifying uncertainty with formal guarantees, and deciding when to abstain rather than guess. Confidence is the primary signal for all three.

Three complementary objectives drive this section: (1) calibration — confidence should reflect true accuracy, (2) selective reliability — using confidence to flag uncertain predictions, and (3) conformal coverage — formal guarantees that predictions or abstentions cover the truth.

Actionable Signals

Before detecting failures, we must first elicit reliable confidence signals from models.

Kadavath et al. · 2022
Asks models directly for P(True) or P(I know), using verbal confidence as a reliability signal.
Self Unit: answer/query Role: elicit Access: Pr
Tian et al. · EMNLP 2023
Elicits verbal confidence statements; calibrates them to predict accuracy on held-out tasks.
Self Unit: answer Role: elicit, calibrate Access: Pr
Ji et al. · 2025
Detects mismatch between verbalized uncertainty and semantic representations in hidden states.
Mechanistic Unit: hidden state Role: detect, steer Access: WB
Joo et al. · 2025
Measures consistency between uncertain expressions in text and token-level confidence signals.
Self Unit: answer Role: detect Access: BB

Hallucination Detection

Using confidence-based signals to detect when models generate false or unsupported claims.

Manakul et al. · EMNLP 2023
Samples multiple responses and flags sentences where agreement is low as potentially hallucinated.
Self Unit: sentence Role: detect Access: BB, MS
Farquhar et al. · Nature 2024
Clusters semantic meanings of samples and uses entropy over clusters to detect hallucinations.
Self Unit: answer set Role: detect Access: MS
Han et al. · ICML Workshop 2024
Uses mechanistic probes to compute semantic entropy without sampling; detects uncertainty efficiently.
Mechanistic Unit: hidden state Role: detect Access: WB
Chen et al. · ICLR 2024
Uses eigenvalue-based scoring of activation matrices to detect hallucinations without sampling.
Mechanistic Unit: hidden state Role: detect Access: WB
Azaria & Mitchell · EMNLP Findings 2023
Trains probes on internal states to detect whether models believe their own statements (truthfulness).
Mechanistic Unit: statement Role: detect Access: WB

Conformal Guarantees

Formal coverage guarantees: confidence-based prediction sets or filtering that maintain user-specified error rates.

Kumar et al. · 2023
Applies conformal prediction to multiple-choice QA; outputs sets of answers with coverage guarantees.
Self Unit: answer set Role: set-predict Access: CP
Quach et al. · ICLR 2024
Uses conformal prediction with sampling and rejection to construct output sets with formal coverage.
Self Unit: output set Role: set-predict, filter Access: MS, CP
Mohri & Hashimoto · ICML 2024
Filters claims using conformal entailment scores; guarantees factuality at desired coverage level.
Hybrid Unit: claim Role: filter, back off Access: CP
Cherian et al. · NeurIPS 2024
Extends conformal prediction with conditional guarantees on claim-level factuality.
Hybrid Unit: claim Role: filter Access: CP
Rubin-Toles et al. · ICLR 2025
Applies conformal prediction to claim graphs, maintaining coherence and coverage guarantees.
Hybrid Unit: claim graph Role: filter Access: CP
Su et al. · EMNLP Findings 2024
Implements conformal prediction using only API access by leveraging sample frequency and similarity.
Hybrid Unit: answer set Role: set-predict Access: API, MS, CP
Xi et al. · TMLR 2025
Calibrates temperature scaling via conformal methods, ensuring prediction set coverage.
Self Unit: prediction set Role: calibrate Access: CP

Abstention & Alignment

Training or adapting models to abstain when uncertain, improving reliability and alignment with human expectations.

Ren et al. · NeurIPS Workshop 2023
Models learn to evaluate their own answers token-by-token, signaling abstention when unsure.
Self Unit: answer Role: abstain, select Access: Pr
Chen et al. · EMNLP Findings 2023
Fine-tunes models to improve self-evaluation signals specific to a task, enabling better abstention.
Self Unit: answer Role: abstain Access: PE
Cole et al. · EMNLP 2023
Uses repetition and multi-sample agreement to detect ambiguous questions; abstains for uncertain cases.
Self Unit: query/answer set Role: abstain Access: MS
Zhang et al. · NAACL 2024
Fine-tunes models to recognize knowledge boundaries; abstains for out-of-distribution queries.
Self Unit: query/answer Role: abstain, align Access: FT
Tjandra et al. · NeurIPS Workshop 2024
Fine-tunes models to produce high semantic entropy when uncertain, enabling abstention.
Self Unit: answer Role: abstain, align Access: FT, MS
Li et al. · NeurIPS 2025
Calibrates verbal confidence via token-level Brier score loss, improving alignment with true accuracy.
Self Unit: answer Role: calibrate Access: FT
An & Xu · 2025
Uses RL with semantic clustering rewards to train models that abstain when sample consensus is low.
Self Unit: answer sample Role: abstain, align Access: RL, MS
Huang et al. · UncertaiNLP 2025
Uses feed-forward network activation patterns to detect uncertainty; enables abstention without retraining.
Mechanistic Unit: answer Role: abstain Access: WB

Summary Table

Method Source Signal Unit Role Access
Actionable Signals
P(True) / P(IK) Self Verbal confidence Answer/query Elicit Pr
Just Ask Self Verbal confidence Answer Elicit, calibrate Pr
VUF / MUC Mechanistic Verbal-semantic mismatch Hidden state Detect, steer WB
UExpr Consistency Self Expression consistency Answer Detect BB
Hallucination Detection
SelfCheckGPT Self Sample agreement Sentence Detect BB, MS
Semantic Entropy Self Semantic clustering entropy Answer set Detect MS
SEPs Mechanistic Probed entropy Hidden state Detect WB
INSIDE Mechanistic EigenScore Hidden state Detect WB
Internal State Mechanistic Truth probe Statement Detect WB
Conformal Guarantees
MCQA CP Self Option nonconformity Answer set Set-predict CP
Conformal LM Self Sampling + rejection Output set Set-predict, filter MS, CP
Conformal Factuality Hybrid Entailment scores Claim Filter, back off CP
Enhanced CP Hybrid Conditional factuality Claim Filter CP
Coherent Factuality Hybrid Graph conformal Claim graph Filter CP
API-Only CP Hybrid Sample frequency Answer set Set-predict API, MS, CP
ConfTS Self Conformal calibration Prediction set Calibrate CP
Abstention & Alignment
Self-Evaluation Self Token self-eval Answer Abstain, select Pr
Adapt with Self-Eval Self Adapted self-eval Answer Abstain PE
Selective Ambiguous QA Self Repetition/agreement Query/answer set Abstain MS
R-Tuning Self Knowledge boundary Query/answer Abstain, align FT
SemEnt FT Self Semantic entropy Answer Abstain, align FT, MS
ConfTuner Self Verbal confidence Answer Calibrate FT
FISCORE Self Semantic consensus Answer sample Abstain, align RL, MS
Activation-Based Abstention Mechanistic FFN activation Answer Abstain WB

Access legend: Pr = prompt, BB = black-box, PE = prompt engineering, FT = fine-tuned, WB = white-box, MS = multiple samples, CP = conformal prediction, RL = reinforcement learning, API = API access

Discussion

Risk management with confidence has three complementary objectives that are often in tension:

  • Calibration: Confidence should match true accuracy. Methods like Just Ask, ConfTuner, and ConfTS aim for this directly. But calibration alone doesn't prevent mistakes on OOD data.
  • Selective reliability: Flag uncertain predictions and abstract from them. SelfCheckGPT, Semantic Entropy, and R-Tuning excel here. But flagging too many cases reduces utility.
  • Conformal coverage: Formal guarantees (e.g., "this set contains the truth with probability ≥ 0.9") independent of data distribution. Conformal methods provide this. But they require well-calibrated nonconformity measures.

Best practice combines these: (1) calibrate confidence via fine-tuning or prompt-based elicitation, (2) detect hallucinations via semantic entropy or probes, (3) apply conformal filtering for formal coverage guarantees, and (4) train models to abstain gracefully when uncertain. No single confidence source handles all three objectives equally.