Confidence-Guided Model Selection | Awesome LLM Confidence

Overview

Model selection across a portfolio (small+fast vs. large+accurate, or specialized experts) requires deciding when to escalate and to whom. Confidence provides the control signal: low confidence in a cheaper model triggers deferral to a larger one; high confidence may allow short-circuiting, avoiding the expense.

We distinguish three timing patterns: (1) sequential cascading — run a fast model, defer to slower models based on confidence, (2) pre-call routing — decide the target model before calling any, and (3) hybrid systems — combine both ex-ante pruning and post-hoc escalation.

Sequential Cascading

Start with a cheap or fast model. If confidence is low, defer the query to a larger or more specialized model.

Language Model Cascades: Token-Level Uncertainty and Beyond

Gupta et al. · 2024

Uses token-level uncertainty quantiles to decide when to cascade from small to larger models.

Self Unit: answer Role: defer Access: Post

FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

Chen et al. · 2023

Orchestrates a portfolio of models (GPT-3, Davinci, etc.) with cost-aware cascading based on confidence.

External Unit: query/answer Role: route, defer Access: Post, Aux

AutoMix: Automatically Mixing Language Models

Madaan et al. · NeurIPS 2024

Uses self-verification (entailment checks) and POMDP to decide whether to escalate or finalize an answer.

Hybrid Unit: answer Role: defer Access: Post, Aux

Gatekeeper: Improving Model Cascades Through Confidence Tuning

Rabanser et al. · NeurIPS 2025

Tunes confidence thresholds per model to optimize cost-accuracy tradeoffs in cascading.

Self Unit: answer Role: defer Access: Post, FT

Rational Tuning of LLM Cascades via Probabilistic Modeling

Zellinger & Thomson · TMLR 2025

Models confidence dependencies across stages using copulas; optimizes stopping rules for cascades.

Self Unit: stage Role: stop Access: Post

Faster Cascades via Speculative Decoding

Narasimhan et al. · 2024

Speculatively generates with small models, deferring uncertain tokens/steps to larger ones in mid-generation.

Hybrid Unit: token/stage Role: handoff, stop Access: Mid

Pre-Call Routing

Before calling any model, use predicted suitability confidence to route the query to the best-fit model.

OptLLM: Optimal Assignment of Queries to Large Language Models

Liu et al. · ICWS 2024

Bootstraps model suitability scores with uncertainty estimates to route queries before execution.

External Unit: query Role: route Access: Pre, Aux

RouteLLM: Learning to Route LLMs with Preference Data

Ong et al. · 2024

Trains routers on preference data to predict which model is best for a query.

External Unit: query Role: route Access: Pre, Aux

Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing

Ding et al. · ICLR 2024

Predicts the quality gap between models and uses it to route queries to the cheapest sufficient model.

External Unit: query Role: route Access: Pre, Aux

CARGO: A Framework for Confidence-Aware Routing of Large Language Models

Barrak et al. · 2025

Uses judge/auxiliary confidence in score differences to route; applies tie-breaking rules for close calls.

Auxiliary Unit: query Role: route Access: Pre, Aux

Leveraging Uncertainty Estimation for Efficient LLM Routing

Zhang et al. · 2025

Uses semantic entropy to detect when edge inference is uncertain, triggering offload to cloud model.

Self Unit: answer Role: offload Access: Post, MS

Learning to Route LLMs with Confidence Tokens

Chuang et al. · ICML 2025

Combines verbal confidence and token probabilities to route uncertain queries or suggest rejection.

Self Unit: answer Role: route, reject Access: Post, FT

Hybrid Systems

The best production systems combine ex-ante routing (before execution) with post-hoc escalation (based on observed answer confidence).

A Unified Approach to Routing and Cascading for LLMs

Dekoninck et al. · 2024

Integrates predicted query suitability (pre-call) with observed answer quality (post-hoc) for unified control.

Hybrid Unit: query/answer Role: route, defer Access: Pre, Post

Select-then-Route: Taxonomy Guided Routing for LLMs

Shah & Shridhar · EMNLP Industry 2025

First shortlists candidates via taxonomy matching, then ranks via judge agreement confidence.

Hybrid Unit: query/answer Role: shortlist, defer Access: Pre, Post, Aux

Summary Table

Method	Source	Signal	Unit	Role	Access
Sequential Cascading
LM Cascades	Self	Token uncertainty quantiles	Answer	Defer	Post
FrugalGPT	External	Cost-aware confidence	Query/answer	Route, defer	Post, Aux
AutoMix	Hybrid	Self-verification	Answer	Defer	Post, Aux
Gatekeeper	Self	Tuned thresholds	Answer	Defer	Post, FT
Rational Tuning	Self	Copula confidence	Stage	Stop	Post
Speculative Cascading	Hybrid	Token/step confidence	Token/stage	Handoff, stop	Mid
Pre-Call Routing
OptLLM	External	Suitability uncertainty	Query	Route	Pre, Aux
RouteLLM	External	Router prediction	Query	Route	Pre, Aux
Hybrid LLM	External	Quality gap prediction	Query	Route	Pre, Aux
CARGO	Auxiliary	Score gap + tie-break	Query	Route	Pre, Aux
Semantic Entropy Routing	Self	Semantic entropy	Answer	Offload	Post, MS
Self-REF	Self	Confidence + token prob	Answer	Route, reject	Post, FT
Hybrid Systems
Unified Routing & Cascading	Hybrid	Ex-ante + post-hoc quality	Query/answer	Route, defer	Pre, Post
Select-Then-Route	Hybrid	Taxonomy + judge agreement	Query/answer	Shortlist, defer	Pre, Post, Aux

Discussion

Routing systems must account for timing, signal type, and cost tradeoffs:

Pre-call routing uses predicted suitability, which is inherently uncertain. Threshold tuning and cost-aware optimization become critical. Methods like OptLLM, RouteLLM, and Hybrid LLM all invest in training or bootstrapping better predictors.
Post-hoc cascading uses observed answer confidence, which is more direct but comes after the cost of the first model is already incurred. Works best when the cheap model succeeds often; escalation is the exception.
Mid-generation routing (speculative cascading) combines both: run a fast model with speculative execution, defer mid-stream if confidence drops. Balances latency and cost.

Best practice: Layer ex-ante pruning (filter out clearly-wrong models before execution), post-hoc checking (catch failures from chosen model), and selective escalation (deferred execution only when necessary).