Confidence-Guided Model Selection

Routing, cascading, and deferral across model portfolios (§5)

Overview

Model selection across a portfolio (small+fast vs. large+accurate, or specialized experts) requires deciding when to escalate and to whom. Confidence provides the control signal: low confidence in a cheaper model triggers deferral to a larger one; high confidence may allow short-circuiting, avoiding the expense.

We distinguish three timing patterns: (1) sequential cascading — run a fast model, defer to slower models based on confidence, (2) pre-call routing — decide the target model before calling any, and (3) hybrid systems — combine both ex-ante pruning and post-hoc escalation.

Sequential Cascading

Start with a cheap or fast model. If confidence is low, defer the query to a larger or more specialized model.

Gupta et al. · 2024
Uses token-level uncertainty quantiles to decide when to cascade from small to larger models.
Self Unit: answer Role: defer Access: Post
Chen et al. · 2023
Orchestrates a portfolio of models (GPT-3, Davinci, etc.) with cost-aware cascading based on confidence.
External Unit: query/answer Role: route, defer Access: Post, Aux
Madaan et al. · NeurIPS 2024
Uses self-verification (entailment checks) and POMDP to decide whether to escalate or finalize an answer.
Hybrid Unit: answer Role: defer Access: Post, Aux
Rabanser et al. · NeurIPS 2025
Tunes confidence thresholds per model to optimize cost-accuracy tradeoffs in cascading.
Self Unit: answer Role: defer Access: Post, FT
Zellinger & Thomson · TMLR 2025
Models confidence dependencies across stages using copulas; optimizes stopping rules for cascades.
Self Unit: stage Role: stop Access: Post
Narasimhan et al. · 2024
Speculatively generates with small models, deferring uncertain tokens/steps to larger ones in mid-generation.
Hybrid Unit: token/stage Role: handoff, stop Access: Mid

Pre-Call Routing

Before calling any model, use predicted suitability confidence to route the query to the best-fit model.

Liu et al. · ICWS 2024
Bootstraps model suitability scores with uncertainty estimates to route queries before execution.
External Unit: query Role: route Access: Pre, Aux
Ong et al. · 2024
Trains routers on preference data to predict which model is best for a query.
External Unit: query Role: route Access: Pre, Aux
Ding et al. · ICLR 2024
Predicts the quality gap between models and uses it to route queries to the cheapest sufficient model.
External Unit: query Role: route Access: Pre, Aux
Barrak et al. · 2025
Uses judge/auxiliary confidence in score differences to route; applies tie-breaking rules for close calls.
Auxiliary Unit: query Role: route Access: Pre, Aux
Zhang et al. · 2025
Uses semantic entropy to detect when edge inference is uncertain, triggering offload to cloud model.
Self Unit: answer Role: offload Access: Post, MS
Chuang et al. · ICML 2025
Combines verbal confidence and token probabilities to route uncertain queries or suggest rejection.
Self Unit: answer Role: route, reject Access: Post, FT

Hybrid Systems

The best production systems combine ex-ante routing (before execution) with post-hoc escalation (based on observed answer confidence).

Dekoninck et al. · 2024
Integrates predicted query suitability (pre-call) with observed answer quality (post-hoc) for unified control.
Hybrid Unit: query/answer Role: route, defer Access: Pre, Post
Shah & Shridhar · EMNLP Industry 2025
First shortlists candidates via taxonomy matching, then ranks via judge agreement confidence.
Hybrid Unit: query/answer Role: shortlist, defer Access: Pre, Post, Aux

Summary Table

Method Source Signal Unit Role Access
Sequential Cascading
LM Cascades Self Token uncertainty quantiles Answer Defer Post
FrugalGPT External Cost-aware confidence Query/answer Route, defer Post, Aux
AutoMix Hybrid Self-verification Answer Defer Post, Aux
Gatekeeper Self Tuned thresholds Answer Defer Post, FT
Rational Tuning Self Copula confidence Stage Stop Post
Speculative Cascading Hybrid Token/step confidence Token/stage Handoff, stop Mid
Pre-Call Routing
OptLLM External Suitability uncertainty Query Route Pre, Aux
RouteLLM External Router prediction Query Route Pre, Aux
Hybrid LLM External Quality gap prediction Query Route Pre, Aux
CARGO Auxiliary Score gap + tie-break Query Route Pre, Aux
Semantic Entropy Routing Self Semantic entropy Answer Offload Post, MS
Self-REF Self Confidence + token prob Answer Route, reject Post, FT
Hybrid Systems
Unified Routing & Cascading Hybrid Ex-ante + post-hoc quality Query/answer Route, defer Pre, Post
Select-Then-Route Hybrid Taxonomy + judge agreement Query/answer Shortlist, defer Pre, Post, Aux

Discussion

Routing systems must account for timing, signal type, and cost tradeoffs:

  • Pre-call routing uses predicted suitability, which is inherently uncertain. Threshold tuning and cost-aware optimization become critical. Methods like OptLLM, RouteLLM, and Hybrid LLM all invest in training or bootstrapping better predictors.
  • Post-hoc cascading uses observed answer confidence, which is more direct but comes after the cost of the first model is already incurred. Works best when the cheap model succeeds often; escalation is the exception.
  • Mid-generation routing (speculative cascading) combines both: run a fast model with speculative execution, defer mid-stream if confidence drops. Balances latency and cost.

Best practice: Layer ex-ante pruning (filter out clearly-wrong models before execution), post-hoc checking (catch failures from chosen model), and selective escalation (deferred execution only when necessary).