Overview
Deployment requires managing risk: detecting when models are wrong or hallucinating, quantifying uncertainty with formal guarantees, and deciding when to abstain rather than guess. Confidence is the primary signal for all three.
Three complementary objectives drive this section: (1) calibration — confidence should reflect true accuracy, (2) selective reliability — using confidence to flag uncertain predictions, and (3) conformal coverage — formal guarantees that predictions or abstentions cover the truth.
Actionable Signals
Before detecting failures, we must first elicit reliable confidence signals from models.
Hallucination Detection
Using confidence-based signals to detect when models generate false or unsupported claims.
Conformal Guarantees
Formal coverage guarantees: confidence-based prediction sets or filtering that maintain user-specified error rates.
Abstention & Alignment
Training or adapting models to abstain when uncertain, improving reliability and alignment with human expectations.
Summary Table
| Method | Source | Signal | Unit | Role | Access |
|---|---|---|---|---|---|
| Actionable Signals | |||||
| P(True) / P(IK) | Self | Verbal confidence | Answer/query | Elicit | Pr |
| Just Ask | Self | Verbal confidence | Answer | Elicit, calibrate | Pr |
| VUF / MUC | Mechanistic | Verbal-semantic mismatch | Hidden state | Detect, steer | WB |
| UExpr Consistency | Self | Expression consistency | Answer | Detect | BB |
| Hallucination Detection | |||||
| SelfCheckGPT | Self | Sample agreement | Sentence | Detect | BB, MS |
| Semantic Entropy | Self | Semantic clustering entropy | Answer set | Detect | MS |
| SEPs | Mechanistic | Probed entropy | Hidden state | Detect | WB |
| INSIDE | Mechanistic | EigenScore | Hidden state | Detect | WB |
| Internal State | Mechanistic | Truth probe | Statement | Detect | WB |
| Conformal Guarantees | |||||
| MCQA CP | Self | Option nonconformity | Answer set | Set-predict | CP |
| Conformal LM | Self | Sampling + rejection | Output set | Set-predict, filter | MS, CP |
| Conformal Factuality | Hybrid | Entailment scores | Claim | Filter, back off | CP |
| Enhanced CP | Hybrid | Conditional factuality | Claim | Filter | CP |
| Coherent Factuality | Hybrid | Graph conformal | Claim graph | Filter | CP |
| API-Only CP | Hybrid | Sample frequency | Answer set | Set-predict | API, MS, CP |
| ConfTS | Self | Conformal calibration | Prediction set | Calibrate | CP |
| Abstention & Alignment | |||||
| Self-Evaluation | Self | Token self-eval | Answer | Abstain, select | Pr |
| Adapt with Self-Eval | Self | Adapted self-eval | Answer | Abstain | PE |
| Selective Ambiguous QA | Self | Repetition/agreement | Query/answer set | Abstain | MS |
| R-Tuning | Self | Knowledge boundary | Query/answer | Abstain, align | FT |
| SemEnt FT | Self | Semantic entropy | Answer | Abstain, align | FT, MS |
| ConfTuner | Self | Verbal confidence | Answer | Calibrate | FT |
| FISCORE | Self | Semantic consensus | Answer sample | Abstain, align | RL, MS |
| Activation-Based Abstention | Mechanistic | FFN activation | Answer | Abstain | WB |
Access legend: Pr = prompt, BB = black-box, PE = prompt engineering, FT = fine-tuned, WB = white-box, MS = multiple samples, CP = conformal prediction, RL = reinforcement learning, API = API access
Discussion
Risk management with confidence has three complementary objectives that are often in tension:
- Calibration: Confidence should match true accuracy. Methods like Just Ask, ConfTuner, and ConfTS aim for this directly. But calibration alone doesn't prevent mistakes on OOD data.
- Selective reliability: Flag uncertain predictions and abstract from them. SelfCheckGPT, Semantic Entropy, and R-Tuning excel here. But flagging too many cases reduces utility.
- Conformal coverage: Formal guarantees (e.g., "this set contains the truth with probability ≥ 0.9") independent of data distribution. Conformal methods provide this. But they require well-calibrated nonconformity measures.
Best practice combines these: (1) calibrate confidence via fine-tuning or prompt-based elicitation, (2) detect hallucinations via semantic entropy or probes, (3) apply conformal filtering for formal coverage guarantees, and (4) train models to abstain gracefully when uncertain. No single confidence source handles all three objectives equally.