Confidence in Agentic Systems

From single decisions to composed action loops with layered control (§8)

Overview

Agentic systems compose multiple steps, decisions, and agents over trajectories that can extend for minutes or hours. Confidence at individual steps becomes a critical control signal for deciding whether to continue, replan, escalate, search deeper, or defer to humans.

Four patterns emerge: (1) selective escalation — trigger expensive operations (debates, deeper search, human input) when confidence is low, (2) self-correction — detect and fix errors within a trajectory, (3) verifier-guided search — use reward or process models to guide exploration (MCTS, RL), and (4) multi-agent deliberation — aggregate confidence across agents to reach collective decisions.

Selective Escalation

Use confidence to decide when to invoke expensive or escalated mechanisms: multi-agent debate, longer searches, additional compute, or human intervention.

Fan et al. · 2025
Detects hesitation in model outputs; triggers multi-agent debate when confidence is low.
Self Unit: query Role: trigger debate Access: Pr, FT, MA
Park et al. · NeurIPS 2025
Calibrates process reward model confidence to allocate compute budget; low confidence → deeper search.
Auxiliary Unit: prefix/traj Role: allocate budget Access: Aux, MS, FT
Snell et al. · ICLR 2025 (Oral)
Predicts query difficulty and verifier value; allocates compute based on expected improvement from extra steps.
Hybrid Unit: query/traj Role: allocate compute Access: Aux, MS
Xu et al. · EACL 2026
Uses verifier signals to detect failed subtasks; triggers retry or replan with higher-level reasoning.
Auxiliary Unit: subtask/plan Role: retry, replan Access: Env, Aux

Self-Correction

Rather than accepting the first output, these systems use confidence signals to detect errors and refine trajectories.

Lee et al. · ICML 2025
Trains models to verify and revise their own steps; uses confidence in verification to decide whether to revise.
Self Unit: step/traj Role: revise, stop Access: FT
Shi et al. · 2025
Uses confidence in intermediate steps to locate errors; applies Socratic refinement to fix them.
Self Unit: step Role: locate, refine Access: MS, Pr
Wu et al. · EMNLP 2025
Uses verifier signals to detect action failures; backtracks and retries with alternative strategies.
Auxiliary Unit: page/action Role: detect, backtrack Access: Env, Aux

Related baselines (behavioral): Self-Refine (NeurIPS 2023), Reflexion (NeurIPS 2023), ASTRO (2025), Self-Backtracking (2025). These work but use less structured confidence signals.

Verifier-Guided Search

Use verifier or process rewards as confidence signals to guide tree search (MCTS) or beam search, directing computational budget toward promising branches.

Zhou et al. · ICML 2024
Combines MCTS with value function and self-reflection; uses confidence to expand, select, and backpropagate.
Hybrid Unit: node/traj Role: expand, select Access: Env, MS
Zhang et al. · 2024
Uses process reward model confidence to guide MCTS expansion and filtering.
Auxiliary Unit: step/traj Role: expand, filter Access: MS, FT
Song et al. · 2026
Estimates PRM uncertainty via Monte Carlo dropout; uses uncertainty to guide search and allocate compute.
Auxiliary Unit: step/node Role: guide search Access: Aux, MS, FT
Xia et al. · ACL 2025
Trains trajectory-level reward models; uses confidence to guide search and select best trajectories.
Auxiliary Unit: step/traj Role: guide search Access: Aux, FT
Peng et al. · 2025
Combines reward model and verifier confidence; uses ensemble to select and supervise trajectories.
Hybrid Unit: response/traj Role: select, supervise Access: Aux, FT

Multi-Agent Deliberation

When multiple agents or samples are available, aggregate their confidence signals to reach collective decisions.

Chen et al. · ACL 2024
Weights agent votes by calibrated confidence; iteratively refines consensus through weighted aggregation.
Peer Unit: answer/round Role: aggregate, update Access: MA
Lin & Hooi · EMNLP Findings 2025
Uses layer norm signals and verbal confidence to weight agent responses; aggregates and revises collectively.
Peer Unit: response/turn Role: aggregate, revise Access: MA

Related diagnostic works: Can LLMs Debate? (2025), Voting or Consensus (ACL Findings 2025), Free-MAD (2025). These analyze multi-agent behaviors but focus on mechanics rather than confidence-based control.

Summary Table

Method Source Signal Unit Role Access
Selective Escalation
iMAD Self Hesitation features Query Trigger debate Pr, FT, MA
PRM Calibration Auxiliary Calibrated PRM Prefix/traj Allocate budget Aux, MS, FT
Scaling TTC Hybrid Difficulty + verifier Query/traj Allocate compute Aux, MS
VeriMAP Auxiliary Planner verification Subtask/plan Retry, replan Env, Aux
Self-Correction
ReVISE Self Self-verification Step/traj Revise, stop FT
SSR Self Step confidence Step Locate, refine MS, Pr
BacktrackAgent Auxiliary Verifier signals Page/action Detect, backtrack Env, Aux
Verifier-Guided Search
LATS Hybrid Value + reflection Node/traj Expand, select Env, MS
ReST-MCTS Auxiliary Process reward Step/traj Expand, filter MS, FT
UATS Auxiliary PRM uncertainty Step/node Guide search Aux, MS, FT
AgentRM Auxiliary Trajectory reward Step/traj Guide search Aux, FT
RewardAgent Hybrid RM + verifier Response/traj Select, supervise Aux, FT
Multi-Agent Deliberation
ReConcile Peer Calibrated confidence Answer/round Aggregate, update MA
ConfMAD Peer LN + verbal conf Response/turn Aggregate, revise MA

Access legend: Pr = prompt, FT = fine-tuned, Aux = auxiliary model, Env = environment, MA = multi-agent, MS = multiple samples

Discussion

Agentic systems fundamentally extend the scope of confidence control. Single-decision inference operates in milliseconds; agentic loops operate over minutes or hours with dozens of intermediate steps. Confidence signals must be reliable across this extended horizon.

Four open questions dominate:

  • Composition: How do confidence signals compose across steps? Low confidence in step 1 does not directly tell us whether the entire trajectory will succeed. Methods like LATS and ReST-MCTS use process or trajectory rewards, but these require auxiliary training.
  • Signal alignment: Self-verification (ReVISE), process rewards (LATS), and verifier judging (VeriMAP) all claim to detect errors. When they disagree, which should we trust? Best practice layers multiple signals (ReConcile, ConfMAD) but this increases complexity.
  • Compute budgeting: Scaling TTC and PRM Calibration allocate compute based on confidence, but the optimal allocation is unclear. Should hard queries get more samples or deeper search?
  • Human escalation: Most papers assume escalation is to a larger model or human. When confidence is low, the right action may be to ask a human for clarification or approval. This is underexplored.

Key insight: Composition is the bottleneck. The survey has strong individual building blocks (confidence signals, verifiers, routing, RAG, risk control). Integrating them into reliable, composable agentic systems remains an open research frontier.