Confidence-Aware Training | Awesome LLM Confidence

Overview

Training fundamentally depends on quality signals. Rather than treating all data equally, confidence-aware training uses reliability estimates to allocate gradient mass: curate which examples to train on, weight how much each example influences the loss, and guide preference optimization away from regions the model is uncertain about.

Three core mechanisms emerge: (1) data selection filters low-quality or high-uncertainty instances before training, (2) fine-tuning & distillation use confidence to weight token-level or example-level losses, and (3) preference optimization & RL use confidence to shape reward shaping and sample efficiency.

Data Selection

Low-confidence examples (uncertain predictions, large performance gaps) are often the most valuable to learn from. These papers use various confidence signals to score and filter training data.

From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning

Li et al. · NAACL 2024

Identifies false documents via the gap between answer likelihood in relevant vs. irrelevant passages. Selects high-uncertainty QA pairs for training.

Self Unit: example

Superfiltering: Weak-to-Strong Data Filtering for Fast Instruction-Tuning

Li et al. · 2024

Transfers weak model confidence to filter data for strong model fine-tuning. Leverages likelihood gaps across model sizes.

Self Unit: example

Automatic Instruction Data Selection for Large Language Models via Uncertainty-Aware Influence Maximization

Han et al. · WWW 2025

Combines model uncertainty with data influence scores on a graph to select high-value training examples.

Hybrid Unit: example

Active Preference Learning for Large Language Models

Muldrew et al. · 2024

Selects preference pairs using model uncertainty and preference confidence to maximize learning value.

Hybrid Unit: pair

Selectit: Selective Instruction Tuning for Large Language Models via Uncertainty-Aware Self-Reflection

Liu et al. · 2024

Uses model confidence in self-reflection to select informative training examples about its own mistakes.

Hybrid Unit: example

Automated Data Curation for Robust Language Model Fine-Tuning

Chen & Mueller · 2024

An auxiliary confidence evaluator scores and filters noisy labels, rectifying low-confidence annotations.

Hybrid Unit: example

How to Train Data-Efficient LLMs

Sachdeva et al. · 2024

Queries an auxiliary judge model for confidence that an example is useful for training, selecting high-confidence instances.

Auxiliary Unit: example

Fine-Tuning & Distillation

Rather than binary selection, these methods weight or gate the training loss by token-level or example-level confidence, allowing the model to learn from all data but with adaptive emphasis.

Enhancing Trust in Large Language Models with Uncertainty-Aware Fine-Tuning

Krishnan et al. · 2024

Modifies the causal language modeling loss to down-weight tokens the model is already confident about.

Self Unit: token

Know the Unknown: An Uncertainty-Sensitive Method for LLM Instruction Tuning

Li et al. · ACL Findings 2025

Fine-tunes models to predict whether context is sufficient, gating the training signal for answer quality.

Self Unit: QA pair

C-LoRA: Contextual Low-Rank Adaptation for Uncertainty Estimation in Large Language Models

Rahmati et al. · 2025

Uses low-rank adapters gated by example-level confidence to enable adaptive fine-tuning for different data qualities.

Self Unit: example

SelecTKD: Selective Token-Weighted Knowledge Distillation for LLMs

Huang et al. · 2025

Teacher model's confidence in intermediate tokens determines which tokens are distilled to student.

Auxiliary Unit: token

Revisiting Knowledge Distillation for Autoregressive Language Models

Zhong et al. · 2024

Splits training into easy tokens (supervised directly) and hard tokens (teacher-guided), using teacher uncertainty.

Auxiliary Unit: token

Selective Reflection-Tuning: Student-Selected Data Recycling for LLM Instruction-Tuning

Li et al. · ACL Findings 2024

Scores training data by student difficulty and teacher-student compatibility, prioritizing hard-but-compatible examples.

Self Unit: example

Preference Optimization & Reinforcement Learning

Modern alignment relies on pairwise preferences and reward shaping. These papers use confidence to weight which pairs are trained on and how strongly, or to shape intrinsic rewards based on model uncertainty.

DPO Variants and Extensions

Direct Preference Optimization: Your Language Model Is Secretly a Reward Model

Rafailov et al. · NeurIPS 2023

Direct Preference Optimization using log-probability ratios. Foundational baseline for confidence-aware variants.

Self Unit: pair

SimPO: Simple Preference Optimization with a Reference-Free Reward

Meng et al. · NeurIPS 2024

Simplifies DPO using average log probabilities, removing ratio. Standard baseline for modern methods.

Self Unit: pair

β-DPO: Direct Preference Optimization with Dynamic β

Wu et al. · NeurIPS 2024

Dynamically adjusts the scaling parameter beta based on preference confidence, moderating optimization for uncertain pairs.

Self Unit: batch/pair

ConfPO: Exploiting Policy Model Confidence for Critical Token Selection in Preference Optimization

Yoon et al. · ICML 2025

Focuses preference optimization on tokens where the model is uncertain, maximizing learning value.

Self Unit: token

CAPO: Confidence Aware Preference Optimization Learning for Multilingual Preferences

Pokharel et al. · 2025

Modulates reward margin in preference pairs based on confidence, suppressing optimization for ambiguous preferences.

Self Unit: pair

Enhancing LLM Reasoning via Non-Human-Like Reasoning Path Preference Optimization

Lu et al. · 2025

Uses token-level confidence to identify optimal divergence points in preference pairs, improving sample efficiency.

Self Unit: step

Reinforcement Learning with Confidence Rewards

Confidence as a Reward: Transforming LLMs into Reward Models

Du et al. · 2025

Uses the model's own confidence in final answers as an intrinsic reward signal for RL training.

Self Unit: response

Confidence Is All You Need: Few-Shot RL Fine-Tuning of Language Models

Li et al. · 2025

Shapes RL rewards by penalizing low-confidence predictions, encouraging the model to express uncertainty appropriately.

Self Unit: response

Maximizing Confidence Alone Improves Reasoning

Prabhudesai et al. · 2025

Uses negative entropy over action distributions as an intrinsic reward, encouraging confident, decisive trajectories.

Self Unit: trajectory

CoDaPo: Confidence and Difficulty-Adaptive Policy Optimization for Post-Training Language Models

Zhou et al. · ICML Workshop 2025

Weights preference pairs by sequence-level confidence and difficulty, balancing learning across pairs.

Hybrid Unit: trajectory

C²GSPG: Calibration-Aware Sequence RL

Liu et al. · 2025

Ensures trajectory-level calibration during sequence RL, preventing overconfident optimization.

Self Unit: trajectory

RL with Reward Uncertainty

Uncertainty-Penalized Reinforcement Learning from Human Feedback with Diverse Reward LoRA Ensembles

Zhai et al. · 2024

Uses disagreement among reward models as uncertainty, penalizing optimization in uncertain regions.

Auxiliary Unit: response

Towards Reliable Alignment: Uncertainty-Aware RLHF

Banerjee & Gopalan · 2024

Monitors reward variance and applies conservative constraints when uncertainty is high.

Auxiliary Unit: response

Taming Overconfidence in LLMs: Reward Calibration in RLHF

Leng et al. · 2024

Detects and corrects overconfident reward models, preventing misaligned RL training.

Hybrid Unit: response

Mitigating LLM Hallucination via Behaviorally Calibrated Reinforcement Learning

Wu et al. · 2025

Trains models to abstain when confidence is low, preventing risky RL updates in uncertain regions.

Self Unit: response/claim

Summary Table

Method	Source	Signal	Unit	Training Action
Data Curation
Cherry	Self	Answer likelihood gap	Example	Filter
Superfiltering	Self	Weak model likelihood	Example	Transfer filter
UniMax	Hybrid	Uncertainty + influence	Example	Score, select
Active-Pref	Hybrid	Entropy + preference cert.	Pair	Select
SelectIT	Hybrid	Self-reflection confidence	Example	Select
CLEAR	Hybrid	Evaluator score	Example	Filter, rectify
Ask-LLM	Auxiliary	Judge P(useful)	Example	Select
Fine-Tuning & Distillation
UA-CLM	Self	Token probability	Token	Weight loss
US-Tuning	Self	Context sufficiency	QA pair	Gate signal
C-LoRA	Self	Example confidence	Example	Gate adapter
SelecTKD	Auxiliary	Teacher token confidence	Token	Selective distill
ATKD	Auxiliary	Teacher uncertainty	Token	Split supervision
SRD	Self	Difficulty + compatibility	Example	Prioritize
Preference Optimization & RL
DPO	Self	Log prob ratio	Pair	Optimize
SimPO	Self	Avg log prob	Pair	Optimize
β-DPO	Self	Pair confidence	Pair	Adaptive weight
ConfPO	Self	Token probability	Token	Select tokens
CAPO	Self	Confidence	Pair	Modulate margin
CGPO	Self	Token confidence	Step	Find split point
CRew	Self	Answer confidence	Response	Intrinsic reward
RLSC	Self	Self-confidence	Response	Reward shape
RENT	Self	Negative entropy	Trajectory	Intrinsic reward
UP-RLHF	Auxiliary	Reward ensemble variance	Response	Uncertainty penalty
UA-RLHF	Auxiliary	Reward variance	Response	Conservative constraint
Taming-OC	Hybrid	Calibration + confidence	Response	Correct reward
BCHRL	Self	Confidence	Response/claim	Abstention policy

Discussion

Training uses confidence to allocate gradient mass. Three patterns dominate:

Low self-confidence = high learning value. Cherry, ConfPO, and CGPO all target examples or tokens where the model expresses uncertainty. The intuition: if the model is confident, little is gained by further optimization. If it is uncertain, gradient is most valuable.
High auxiliary confidence = trustworthy supervision. SelecTKD, ATKD, and CLEAR use teacher or judge confidence to decide which supervision signals to use. This prevents learning from noisy or low-confidence teacher outputs.
High reward uncertainty = suppress optimization. UP-RLHF, UA-RLHF, and Taming-OC all recognize that when reward models disagree or express low confidence, the RL gradient may be misleading. Conservative measures protect the learned policy.

The key insight: confidence is not just a metric—it is a control signal that decides which parameters to update and how much. The next generation of training methods will likely integrate confidence-aware mechanisms more deeply into the learning dynamics itself.