Confidence-Aware Training

How confidence governs data curation, fine-tuning, and alignment (§3)

Overview

Training fundamentally depends on quality signals. Rather than treating all data equally, confidence-aware training uses reliability estimates to allocate gradient mass: curate which examples to train on, weight how much each example influences the loss, and guide preference optimization away from regions the model is uncertain about.

Three core mechanisms emerge: (1) data selection filters low-quality or high-uncertainty instances before training, (2) fine-tuning & distillation use confidence to weight token-level or example-level losses, and (3) preference optimization & RL use confidence to shape reward shaping and sample efficiency.

Data Selection

Low-confidence examples (uncertain predictions, large performance gaps) are often the most valuable to learn from. These papers use various confidence signals to score and filter training data.

Li et al. · NAACL 2024
Identifies false documents via the gap between answer likelihood in relevant vs. irrelevant passages. Selects high-uncertainty QA pairs for training.
Self Unit: example
Li et al. · 2024
Transfers weak model confidence to filter data for strong model fine-tuning. Leverages likelihood gaps across model sizes.
Self Unit: example
Han et al. · WWW 2025
Combines model uncertainty with data influence scores on a graph to select high-value training examples.
Hybrid Unit: example
Muldrew et al. · 2024
Selects preference pairs using model uncertainty and preference confidence to maximize learning value.
Hybrid Unit: pair
Liu et al. · 2024
Uses model confidence in self-reflection to select informative training examples about its own mistakes.
Hybrid Unit: example
Chen & Mueller · 2024
An auxiliary confidence evaluator scores and filters noisy labels, rectifying low-confidence annotations.
Hybrid Unit: example
Sachdeva et al. · 2024
Queries an auxiliary judge model for confidence that an example is useful for training, selecting high-confidence instances.
Auxiliary Unit: example

Fine-Tuning & Distillation

Rather than binary selection, these methods weight or gate the training loss by token-level or example-level confidence, allowing the model to learn from all data but with adaptive emphasis.

Krishnan et al. · 2024
Modifies the causal language modeling loss to down-weight tokens the model is already confident about.
Self Unit: token
Li et al. · ACL Findings 2025
Fine-tunes models to predict whether context is sufficient, gating the training signal for answer quality.
Self Unit: QA pair
Rahmati et al. · 2025
Uses low-rank adapters gated by example-level confidence to enable adaptive fine-tuning for different data qualities.
Self Unit: example
Huang et al. · 2025
Teacher model's confidence in intermediate tokens determines which tokens are distilled to student.
Auxiliary Unit: token
Zhong et al. · 2024
Splits training into easy tokens (supervised directly) and hard tokens (teacher-guided), using teacher uncertainty.
Auxiliary Unit: token
Li et al. · ACL Findings 2024
Scores training data by student difficulty and teacher-student compatibility, prioritizing hard-but-compatible examples.
Self Unit: example

Preference Optimization & Reinforcement Learning

Modern alignment relies on pairwise preferences and reward shaping. These papers use confidence to weight which pairs are trained on and how strongly, or to shape intrinsic rewards based on model uncertainty.

DPO Variants and Extensions

Rafailov et al. · NeurIPS 2023
Direct Preference Optimization using log-probability ratios. Foundational baseline for confidence-aware variants.
Self Unit: pair
Meng et al. · NeurIPS 2024
Simplifies DPO using average log probabilities, removing ratio. Standard baseline for modern methods.
Self Unit: pair
Wu et al. · NeurIPS 2024
Dynamically adjusts the scaling parameter beta based on preference confidence, moderating optimization for uncertain pairs.
Self Unit: batch/pair
Yoon et al. · ICML 2025
Focuses preference optimization on tokens where the model is uncertain, maximizing learning value.
Self Unit: token
Pokharel et al. · 2025
Modulates reward margin in preference pairs based on confidence, suppressing optimization for ambiguous preferences.
Self Unit: pair
Lu et al. · 2025
Uses token-level confidence to identify optimal divergence points in preference pairs, improving sample efficiency.
Self Unit: step

Reinforcement Learning with Confidence Rewards

Du et al. · 2025
Uses the model's own confidence in final answers as an intrinsic reward signal for RL training.
Self Unit: response
Li et al. · 2025
Shapes RL rewards by penalizing low-confidence predictions, encouraging the model to express uncertainty appropriately.
Self Unit: response
Prabhudesai et al. · 2025
Uses negative entropy over action distributions as an intrinsic reward, encouraging confident, decisive trajectories.
Self Unit: trajectory
Zhou et al. · ICML Workshop 2025
Weights preference pairs by sequence-level confidence and difficulty, balancing learning across pairs.
Hybrid Unit: trajectory
Liu et al. · 2025
Ensures trajectory-level calibration during sequence RL, preventing overconfident optimization.
Self Unit: trajectory

RL with Reward Uncertainty

Zhai et al. · 2024
Uses disagreement among reward models as uncertainty, penalizing optimization in uncertain regions.
Auxiliary Unit: response
Banerjee & Gopalan · 2024
Monitors reward variance and applies conservative constraints when uncertainty is high.
Auxiliary Unit: response
Leng et al. · 2024
Detects and corrects overconfident reward models, preventing misaligned RL training.
Hybrid Unit: response
Wu et al. · 2025
Trains models to abstain when confidence is low, preventing risky RL updates in uncertain regions.
Self Unit: response/claim

Summary Table

Method Source Signal Unit Training Action
Data Curation
Cherry Self Answer likelihood gap Example Filter
Superfiltering Self Weak model likelihood Example Transfer filter
UniMax Hybrid Uncertainty + influence Example Score, select
Active-Pref Hybrid Entropy + preference cert. Pair Select
SelectIT Hybrid Self-reflection confidence Example Select
CLEAR Hybrid Evaluator score Example Filter, rectify
Ask-LLM Auxiliary Judge P(useful) Example Select
Fine-Tuning & Distillation
UA-CLM Self Token probability Token Weight loss
US-Tuning Self Context sufficiency QA pair Gate signal
C-LoRA Self Example confidence Example Gate adapter
SelecTKD Auxiliary Teacher token confidence Token Selective distill
ATKD Auxiliary Teacher uncertainty Token Split supervision
SRD Self Difficulty + compatibility Example Prioritize
Preference Optimization & RL
DPO Self Log prob ratio Pair Optimize
SimPO Self Avg log prob Pair Optimize
β-DPO Self Pair confidence Pair Adaptive weight
ConfPO Self Token probability Token Select tokens
CAPO Self Confidence Pair Modulate margin
CGPO Self Token confidence Step Find split point
CRew Self Answer confidence Response Intrinsic reward
RLSC Self Self-confidence Response Reward shape
RENT Self Negative entropy Trajectory Intrinsic reward
UP-RLHF Auxiliary Reward ensemble variance Response Uncertainty penalty
UA-RLHF Auxiliary Reward variance Response Conservative constraint
Taming-OC Hybrid Calibration + confidence Response Correct reward
BCHRL Self Confidence Response/claim Abstention policy

Discussion

Training uses confidence to allocate gradient mass. Three patterns dominate:

  • Low self-confidence = high learning value. Cherry, ConfPO, and CGPO all target examples or tokens where the model expresses uncertainty. The intuition: if the model is confident, little is gained by further optimization. If it is uncertain, gradient is most valuable.
  • High auxiliary confidence = trustworthy supervision. SelecTKD, ATKD, and CLEAR use teacher or judge confidence to decide which supervision signals to use. This prevents learning from noisy or low-confidence teacher outputs.
  • High reward uncertainty = suppress optimization. UP-RLHF, UA-RLHF, and Taming-OC all recognize that when reward models disagree or express low confidence, the RL gradient may be misleading. Conservative measures protect the learned policy.

The key insight: confidence is not just a metric—it is a control signal that decides which parameters to update and how much. The next generation of training methods will likely integrate confidence-aware mechanisms more deeply into the learning dynamics itself.