Overview
Agentic systems compose multiple steps, decisions, and agents over trajectories that can extend for minutes or hours. Confidence at individual steps becomes a critical control signal for deciding whether to continue, replan, escalate, search deeper, or defer to humans.
Four patterns emerge: (1) selective escalation — trigger expensive operations (debates, deeper search, human input) when confidence is low, (2) self-correction — detect and fix errors within a trajectory, (3) verifier-guided search — use reward or process models to guide exploration (MCTS, RL), and (4) multi-agent deliberation — aggregate confidence across agents to reach collective decisions.
Selective Escalation
Use confidence to decide when to invoke expensive or escalated mechanisms: multi-agent debate, longer searches, additional compute, or human intervention.
Self-Correction
Rather than accepting the first output, these systems use confidence signals to detect errors and refine trajectories.
Related baselines (behavioral): Self-Refine (NeurIPS 2023), Reflexion (NeurIPS 2023), ASTRO (2025), Self-Backtracking (2025). These work but use less structured confidence signals.
Verifier-Guided Search
Use verifier or process rewards as confidence signals to guide tree search (MCTS) or beam search, directing computational budget toward promising branches.
Multi-Agent Deliberation
When multiple agents or samples are available, aggregate their confidence signals to reach collective decisions.
Related diagnostic works: Can LLMs Debate? (2025), Voting or Consensus (ACL Findings 2025), Free-MAD (2025). These analyze multi-agent behaviors but focus on mechanics rather than confidence-based control.
Summary Table
| Method | Source | Signal | Unit | Role | Access |
|---|---|---|---|---|---|
| Selective Escalation | |||||
| iMAD | Self | Hesitation features | Query | Trigger debate | Pr, FT, MA |
| PRM Calibration | Auxiliary | Calibrated PRM | Prefix/traj | Allocate budget | Aux, MS, FT |
| Scaling TTC | Hybrid | Difficulty + verifier | Query/traj | Allocate compute | Aux, MS |
| VeriMAP | Auxiliary | Planner verification | Subtask/plan | Retry, replan | Env, Aux |
| Self-Correction | |||||
| ReVISE | Self | Self-verification | Step/traj | Revise, stop | FT |
| SSR | Self | Step confidence | Step | Locate, refine | MS, Pr |
| BacktrackAgent | Auxiliary | Verifier signals | Page/action | Detect, backtrack | Env, Aux |
| Verifier-Guided Search | |||||
| LATS | Hybrid | Value + reflection | Node/traj | Expand, select | Env, MS |
| ReST-MCTS | Auxiliary | Process reward | Step/traj | Expand, filter | MS, FT |
| UATS | Auxiliary | PRM uncertainty | Step/node | Guide search | Aux, MS, FT |
| AgentRM | Auxiliary | Trajectory reward | Step/traj | Guide search | Aux, FT |
| RewardAgent | Hybrid | RM + verifier | Response/traj | Select, supervise | Aux, FT |
| Multi-Agent Deliberation | |||||
| ReConcile | Peer | Calibrated confidence | Answer/round | Aggregate, update | MA |
| ConfMAD | Peer | LN + verbal conf | Response/turn | Aggregate, revise | MA |
Access legend: Pr = prompt, FT = fine-tuned, Aux = auxiliary model, Env = environment, MA = multi-agent, MS = multiple samples
Discussion
Agentic systems fundamentally extend the scope of confidence control. Single-decision inference operates in milliseconds; agentic loops operate over minutes or hours with dozens of intermediate steps. Confidence signals must be reliable across this extended horizon.
Four open questions dominate:
- Composition: How do confidence signals compose across steps? Low confidence in step 1 does not directly tell us whether the entire trajectory will succeed. Methods like LATS and ReST-MCTS use process or trajectory rewards, but these require auxiliary training.
- Signal alignment: Self-verification (ReVISE), process rewards (LATS), and verifier judging (VeriMAP) all claim to detect errors. When they disagree, which should we trust? Best practice layers multiple signals (ReConcile, ConfMAD) but this increases complexity.
- Compute budgeting: Scaling TTC and PRM Calibration allocate compute based on confidence, but the optimal allocation is unclear. Should hard queries get more samples or deeper search?
- Human escalation: Most papers assume escalation is to a larger model or human. When confidence is low, the right action may be to ask a human for clarification or approval. This is underexplored.
Key insight: Composition is the bottleneck. The survey has strong individual building blocks (confidence signals, verifiers, routing, RAG, risk control). Integrating them into reliable, composable agentic systems remains an open research frontier.