Confidence in Agentic Systems | Awesome LLM Confidence

Overview

Agentic systems compose multiple steps, decisions, and agents over trajectories that can extend for minutes or hours. Confidence at individual steps becomes a critical control signal for deciding whether to continue, replan, escalate, search deeper, or defer to humans.

Four patterns emerge: (1) selective escalation — trigger expensive operations (debates, deeper search, human input) when confidence is low, (2) self-correction — detect and fix errors within a trajectory, (3) verifier-guided search — use reward or process models to guide exploration (MCTS, RL), and (4) multi-agent deliberation — aggregate confidence across agents to reach collective decisions.

Selective Escalation

Use confidence to decide when to invoke expensive or escalated mechanisms: multi-agent debate, longer searches, additional compute, or human intervention.

iMAD: Intelligent Multi-Agent Debate for Efficient and Accurate LLM Inference

Fan et al. · 2025

Detects hesitation in model outputs; triggers multi-agent debate when confidence is low.

Self Unit: query Role: trigger debate Access: Pr, FT, MA

Know What You Don't Know: Uncertainty Calibration of Process Reward Models

Park et al. · NeurIPS 2025

Calibrates process reward model confidence to allocate compute budget; low confidence → deeper search.

Auxiliary Unit: prefix/traj Role: allocate budget Access: Aux, MS, FT

Scaling LLM Test-Time Compute Optimally Can be More Effective than Scaling Parameters for Reasoning

Snell et al. · ICLR 2025 (Oral)

Predicts query difficulty and verifier value; allocates compute based on expected improvement from extra steps.

Hybrid Unit: query/traj Role: allocate compute Access: Aux, MS

Verification-Aware Planning for Multi-Agent Systems

Xu et al. · EACL 2026

Uses verifier signals to detect failed subtasks; triggers retry or replan with higher-level reasoning.

Auxiliary Unit: subtask/plan Role: retry, replan Access: Env, Aux

Self-Correction

Rather than accepting the first output, these systems use confidence signals to detect errors and refine trajectories.

ReVISE: Learning to Refine at Test-Time via Intrinsic Self-Verification

Lee et al. · ICML 2025

Trains models to verify and revise their own steps; uses confidence in verification to decide whether to revise.

Self Unit: step/traj Role: revise, stop Access: FT

SSR: Socratic Self-Refine for Large Language Model Reasoning

Shi et al. · 2025

Uses confidence in intermediate steps to locate errors; applies Socratic refinement to fix them.

Self Unit: step Role: locate, refine Access: MS, Pr

BacktrackAgent: Enhancing GUI Agent with Error Detection and Backtracking Mechanism

Wu et al. · EMNLP 2025

Uses verifier signals to detect action failures; backtracks and retries with alternative strategies.

Auxiliary Unit: page/action Role: detect, backtrack Access: Env, Aux

Related baselines (behavioral): Self-Refine (NeurIPS 2023), Reflexion (NeurIPS 2023), ASTRO (2025), Self-Backtracking (2025). These work but use less structured confidence signals.

Verifier-Guided Search

Use verifier or process rewards as confidence signals to guide tree search (MCTS) or beam search, directing computational budget toward promising branches.

Language Agent Tree Search Unifies Reasoning, Acting, and Planning in Language Models

Zhou et al. · ICML 2024

Combines MCTS with value function and self-reflection; uses confidence to expand, select, and backpropagate.

Hybrid Unit: node/traj Role: expand, select Access: Env, MS

ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search

Zhang et al. · 2024

Uses process reward model confidence to guide MCTS expansion and filtering.

Auxiliary Unit: step/traj Role: expand, filter Access: MS, FT

Adaptive Uncertainty-Aware Tree Search for Robust Reasoning

Song et al. · 2026

Estimates PRM uncertainty via Monte Carlo dropout; uses uncertainty to guide search and allocate compute.

Auxiliary Unit: step/node Role: guide search Access: Aux, MS, FT

AgentRM: Enhancing Agent Generalization with Reward Modeling

Xia et al. · ACL 2025

Trains trajectory-level reward models; uses confidence to guide search and select best trajectories.

Auxiliary Unit: step/traj Role: guide search Access: Aux, FT

Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems

Peng et al. · 2025

Combines reward model and verifier confidence; uses ensemble to select and supervise trajectories.

Hybrid Unit: response/traj Role: select, supervise Access: Aux, FT

Multi-Agent Deliberation

When multiple agents or samples are available, aggregate their confidence signals to reach collective decisions.

ReConcile: Round-Table Conference Improves Reasoning via Consensus among Diverse LLMs

Chen et al. · ACL 2024

Weights agent votes by calibrated confidence; iteratively refines consensus through weighted aggregation.

Peer Unit: answer/round Role: aggregate, update Access: MA

Enhancing Multi-Agent Debate System Performance via Confidence Expression

Lin & Hooi · EMNLP Findings 2025

Uses layer norm signals and verbal confidence to weight agent responses; aggregates and revises collectively.

Peer Unit: response/turn Role: aggregate, revise Access: MA

Related diagnostic works: Can LLMs Debate? (2025), Voting or Consensus (ACL Findings 2025), Free-MAD (2025). These analyze multi-agent behaviors but focus on mechanics rather than confidence-based control.

Summary Table

Method	Source	Signal	Unit	Role	Access
Selective Escalation
iMAD	Self	Hesitation features	Query	Trigger debate	Pr, FT, MA
PRM Calibration	Auxiliary	Calibrated PRM	Prefix/traj	Allocate budget	Aux, MS, FT
Scaling TTC	Hybrid	Difficulty + verifier	Query/traj	Allocate compute	Aux, MS
VeriMAP	Auxiliary	Planner verification	Subtask/plan	Retry, replan	Env, Aux
Self-Correction
ReVISE	Self	Self-verification	Step/traj	Revise, stop	FT
SSR	Self	Step confidence	Step	Locate, refine	MS, Pr
BacktrackAgent	Auxiliary	Verifier signals	Page/action	Detect, backtrack	Env, Aux
Verifier-Guided Search
LATS	Hybrid	Value + reflection	Node/traj	Expand, select	Env, MS
ReST-MCTS	Auxiliary	Process reward	Step/traj	Expand, filter	MS, FT
UATS	Auxiliary	PRM uncertainty	Step/node	Guide search	Aux, MS, FT
AgentRM	Auxiliary	Trajectory reward	Step/traj	Guide search	Aux, FT
RewardAgent	Hybrid	RM + verifier	Response/traj	Select, supervise	Aux, FT
Multi-Agent Deliberation
ReConcile	Peer	Calibrated confidence	Answer/round	Aggregate, update	MA
ConfMAD	Peer	LN + verbal conf	Response/turn	Aggregate, revise	MA

Access legend: Pr = prompt, FT = fine-tuned, Aux = auxiliary model, Env = environment, MA = multi-agent, MS = multiple samples

Discussion

Agentic systems fundamentally extend the scope of confidence control. Single-decision inference operates in milliseconds; agentic loops operate over minutes or hours with dozens of intermediate steps. Confidence signals must be reliable across this extended horizon.

Four open questions dominate:

Composition: How do confidence signals compose across steps? Low confidence in step 1 does not directly tell us whether the entire trajectory will succeed. Methods like LATS and ReST-MCTS use process or trajectory rewards, but these require auxiliary training.
Signal alignment: Self-verification (ReVISE), process rewards (LATS), and verifier judging (VeriMAP) all claim to detect errors. When they disagree, which should we trust? Best practice layers multiple signals (ReConcile, ConfMAD) but this increases complexity.
Compute budgeting: Scaling TTC and PRM Calibration allocate compute based on confidence, but the optimal allocation is unclear. Should hard queries get more samples or deeper search?
Human escalation: Most papers assume escalation is to a larger model or human. When confidence is low, the right action may be to ask a human for clarification or approval. This is underexplored.

Key insight: Composition is the bottleneck. The survey has strong individual building blocks (confidence signals, verifiers, routing, RAG, risk control). Integrating them into reliable, composable agentic systems remains an open research frontier.