Confidence-Driven Inference | Awesome LLM Confidence

Overview

At inference time, confidence shapes three critical decisions: (1) which candidate response to return (output selection), (2) when to stop generating or searching (adaptive stopping), and (3) how to shape the token distribution during decoding (decoding control).

Unlike training, which amortizes decisions across many examples, inference decisions are made online and immediately affect user experience. Confidence signals must be fast, interpretable, and reliable.

Output Selection

When multiple candidate responses are available (from beam search, sampling, or multi-turn generation), confidence scores decide which one to present to the user.

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Wang et al. · ICLR 2023

Samples multiple reasoning paths and selects the most frequently occurring answer, using agreement as confidence.

Self Unit: candidate Role: vote Access: MS

Confidence Improves Self-Consistency in LLMs

Taubenfeld et al. · ACL Findings 2025

Weights votes by the model's implicit confidence P(true) in each reasoning path, improving selection accuracy.

Self Unit: path Role: weighted vote Access: MS

Universal Self-Consistency for Large Language Model Generation

Chen et al. · 2023

Trains model to select its own most consistent response without external voting, embedding confidence selection.

Self Unit: response Role: select Access: MS

ACR: Adaptive Confidence Re-Scoring for Reliable Answer Selection Among Multiple Candidates

Jeong & Choi · 2025

Combines self-consistency frequency with ambiguity detection to re-score answers, improving selection robustness.

Self Unit: answer Role: trigger, rescore Access: MS

Let's Verify Step by Step

Lightman et al. · ICLR 2024

Auxiliary reward model that scores each step in a reasoning path, then reranks complete solutions by accumulated rewards.

Auxiliary Unit: step/solution Role: rerank Access: AV, MS

Math-Shepherd: Verify and Reinforce LLMs Step-by-Step without Human Annotations

Wang et al. · ACL 2024

Automatically generates step-level labels for solutions, enabling training of process rewards for selection.

Auxiliary Unit: step/solution Role: rerank Access: AV, MS

SteerConf: Steering LLMs for Confidence Elicitation

Zhou et al. · NeurIPS 2025

Elicits and calibrates verbal confidence from models, using it to select best responses in diverse settings.

Self Unit: answer Role: calibrate, select Access: BB

Adaptive Stopping & Search

Instead of generating a fixed number of samples or search steps, confidence can decide when enough exploration has been done and it is safe to commit to a solution.

Let's Sample Step by Step: Adaptive-Consistency for Efficient Reasoning and Coding with LLMs

Aggarwal et al. · EMNLP 2023

Stops generating additional reasoning paths once the majority answer achieves stable agreement.

Self Unit: answer set Role: stop Access: MS

Efficient test-time scaling via self-calibration

Huang et al. · 2025

Uses calibrated confidence estimates to determine when to stop voting and commit to the selected answer.

Self Unit: candidate Role: vote, stop Access: MS, FT

Deep think with confidence

Fu et al. · 2025

Monitors the confidence of low-probability tokens in generation; stops or filters when confidence drops below threshold.

Self Unit: token group Role: filter, stop Access: WB, MS

Firm or Fickle? Evaluating Large Language Models Consistency in Sequential Interactions

Li et al. · ACL Findings 2025

Compares answer-level vs. token-level probabilities to decide whether to maintain the answer or revise it.

Self Unit: response/turn Role: maintain, revise Access: WB

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Yao et al. · NeurIPS 2023

Scores intermediate reasoning states with value or voting confidence; expands promising branches and prunes low-confidence ones.

Self Unit: thought state Role: expand, prune Access: MS

Concise: Confidence-guided compression in step-by-step efficient reasoning

Qiao et al. · EMNLP 2025

Uses confidence in self-reflection steps to decide whether to stop iterating or compress the trajectory.

Self Unit: step/trajectory Role: stop, compress Access: FT

Decoding Control

Rather than selecting finished outputs, these methods reshape the token distribution during generation based on confidence signals.

Contrastive decoding: Open-ended text generation as optimization

Li et al. · ACL 2023

Reweights tokens by the gap between expert and amateur model logits, boosting confident expert predictions.

Hybrid Unit: token Role: reweight Access: 2M

DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models

Chuang et al. · ICLR 2024

Contrasts predictions from early and late layers to identify confident, factual tokens.

Self Unit: token Role: reweight Access: WB

Delta--Contrastive Decoding Mitigates Text Hallucinations in Large Language Models

Huang & Chen · 2025

Contrasts model behavior with/without context masking to detect factual (context-dependent) tokens.

Self Unit: token Role: reweight Access: WB

COCOA: Confidence-and Context-Aware Adaptive Decoding for Resolving Knowledge Conflicts in Large Language Models

Khandelwal et al. · EMNLP 2025

Identifies tokens where prior and context disagree, blending confidences to improve factuality.

Hybrid Unit: token Role: blend, reweight Access: WB

Active layer-contrastive decoding reduces hallucination in large language model generation

Zhang et al. · EMNLP 2025

Trains a gate to selectively apply contrastive decoding when model confidence is low.

Self Unit: token Role: gate, reweight Access: WB, FT

Confidence-aware sub-structure beam search (cabs): Mitigating hallucination in structured data generation with large language models

Wei et al. · 2024

Scores sub-structures (phrases, clauses) by confidence and reranks beam hypotheses accordingly.

Self Unit: sub-structure Role: beam rerank Access: WB, FT

Summary Table

Method	Source	Signal	Unit	Role	Access
Output Selection
Self-Consistency	Self	Answer agreement	Candidate	Vote	MS
CISC	Self	Path P(true)	Path	Weighted vote	MS
Universal SC	Self	Self-selection	Response	Select	MS
ACR	Self	SC + ambiguity	Answer	Trigger, rescore	MS
PRM	Auxiliary	Step rewards	Step/solution	Rerank	AV, MS
Math-Shepherd	Auxiliary	Process labels	Step/solution	Rerank	AV, MS
SteerConf	Self	Verbal confidence	Answer	Calibrate, select	BB
Adaptive Stopping & Search
Adaptive-Consistency	Self	Answer stability	Answer set	Stop	MS
Efficient TTS	Self	Response confidence	Candidate	Vote, stop	MS, FT
DeepConf	Self	Low-prob tokens	Token group	Filter, stop	WB, MS
Firm-or-Fickle	Self	Answer vs. token prob	Response/turn	Maintain, revise	WB
ToT	Self	State value/vote	Thought state	Expand, prune	MS
ConCISE	Self	Reflection confidence	Step/trajectory	Stop, compress	FT
Decoding Control
Contrastive Decoding	Hybrid	Expert-amateur gap	Token	Reweight	2M
DoLa	Self	Layer contrast	Token	Reweight	WB
Delta-CD	Self	Masked-context gap	Token	Reweight	WB
CoCoA	Hybrid	Prior-context conflict	Token	Blend, reweight	WB
ActLCD	Self	Learned trigger	Token	Gate, reweight	WB, FT
CABS	Self	Sub-structure confidence	Sub-structure	Beam rerank	WB, FT

Access legend: MS = multiple samples, WB = white-box (internal states), FT = fine-tuned, AV = auxiliary verifier, BB = black-box

Discussion

Inference confidence operates at three distinct time scales:

Candidate → Select/Aggregate. Given multiple complete outputs (from sampling or beam search), confidence decides which to return or how to aggregate. Self-consistency and process rewards both operate here.
State → Continue/Stop/Revise. At intermediate points (after each token, thought, or step), confidence decides whether to commit (stop), continue exploring (expand), or reconsider (revise). Tree-of-Thought and adaptive consistency exemplify this layer.
Token → Reshape Distribution. Within a single forward pass, confidence reshapes logit distributions to boost high-confidence tokens and suppress hallucinations. Contrastive decoding and DoLa operate here.

These signals are not interchangeable. A high self-consistency score does not imply high logit probability. A low-confidence token can still occur in a high-confidence reasoning path. The best systems leverage signals at all three levels, with careful calibration to avoid redundancy and propagation of errors.