Overview
At inference time, confidence shapes three critical decisions: (1) which candidate response to return (output selection), (2) when to stop generating or searching (adaptive stopping), and (3) how to shape the token distribution during decoding (decoding control).
Unlike training, which amortizes decisions across many examples, inference decisions are made online and immediately affect user experience. Confidence signals must be fast, interpretable, and reliable.
Output Selection
When multiple candidate responses are available (from beam search, sampling, or multi-turn generation), confidence scores decide which one to present to the user.
Adaptive Stopping & Search
Instead of generating a fixed number of samples or search steps, confidence can decide when enough exploration has been done and it is safe to commit to a solution.
Decoding Control
Rather than selecting finished outputs, these methods reshape the token distribution during generation based on confidence signals.
Summary Table
| Method | Source | Signal | Unit | Role | Access |
|---|---|---|---|---|---|
| Output Selection | |||||
| Self-Consistency | Self | Answer agreement | Candidate | Vote | MS |
| CISC | Self | Path P(true) | Path | Weighted vote | MS |
| Universal SC | Self | Self-selection | Response | Select | MS |
| ACR | Self | SC + ambiguity | Answer | Trigger, rescore | MS |
| PRM | Auxiliary | Step rewards | Step/solution | Rerank | AV, MS |
| Math-Shepherd | Auxiliary | Process labels | Step/solution | Rerank | AV, MS |
| SteerConf | Self | Verbal confidence | Answer | Calibrate, select | BB |
| Adaptive Stopping & Search | |||||
| Adaptive-Consistency | Self | Answer stability | Answer set | Stop | MS |
| Efficient TTS | Self | Response confidence | Candidate | Vote, stop | MS, FT |
| DeepConf | Self | Low-prob tokens | Token group | Filter, stop | WB, MS |
| Firm-or-Fickle | Self | Answer vs. token prob | Response/turn | Maintain, revise | WB |
| ToT | Self | State value/vote | Thought state | Expand, prune | MS |
| ConCISE | Self | Reflection confidence | Step/trajectory | Stop, compress | FT |
| Decoding Control | |||||
| Contrastive Decoding | Hybrid | Expert-amateur gap | Token | Reweight | 2M |
| DoLa | Self | Layer contrast | Token | Reweight | WB |
| Delta-CD | Self | Masked-context gap | Token | Reweight | WB |
| CoCoA | Hybrid | Prior-context conflict | Token | Blend, reweight | WB |
| ActLCD | Self | Learned trigger | Token | Gate, reweight | WB, FT |
| CABS | Self | Sub-structure confidence | Sub-structure | Beam rerank | WB, FT |
Access legend: MS = multiple samples, WB = white-box (internal states), FT = fine-tuned, AV = auxiliary verifier, BB = black-box
Discussion
Inference confidence operates at three distinct time scales:
- Candidate → Select/Aggregate. Given multiple complete outputs (from sampling or beam search), confidence decides which to return or how to aggregate. Self-consistency and process rewards both operate here.
- State → Continue/Stop/Revise. At intermediate points (after each token, thought, or step), confidence decides whether to commit (stop), continue exploring (expand), or reconsider (revise). Tree-of-Thought and adaptive consistency exemplify this layer.
- Token → Reshape Distribution. Within a single forward pass, confidence reshapes logit distributions to boost high-confidence tokens and suppress hallucinations. Contrastive decoding and DoLa operate here.
These signals are not interchangeable. A high self-consistency score does not imply high logit probability. A low-confidence token can still occur in a high-confidence reasoning path. The best systems leverage signals at all three levels, with careful calibration to avoid redundancy and propagation of errors.