Overview
Model selection across a portfolio (small+fast vs. large+accurate, or specialized experts) requires deciding when to escalate and to whom. Confidence provides the control signal: low confidence in a cheaper model triggers deferral to a larger one; high confidence may allow short-circuiting, avoiding the expense.
We distinguish three timing patterns: (1) sequential cascading — run a fast model, defer to slower models based on confidence, (2) pre-call routing — decide the target model before calling any, and (3) hybrid systems — combine both ex-ante pruning and post-hoc escalation.
Sequential Cascading
Start with a cheap or fast model. If confidence is low, defer the query to a larger or more specialized model.
Pre-Call Routing
Before calling any model, use predicted suitability confidence to route the query to the best-fit model.
Hybrid Systems
The best production systems combine ex-ante routing (before execution) with post-hoc escalation (based on observed answer confidence).
Summary Table
| Method | Source | Signal | Unit | Role | Access |
|---|---|---|---|---|---|
| Sequential Cascading | |||||
| LM Cascades | Self | Token uncertainty quantiles | Answer | Defer | Post |
| FrugalGPT | External | Cost-aware confidence | Query/answer | Route, defer | Post, Aux |
| AutoMix | Hybrid | Self-verification | Answer | Defer | Post, Aux |
| Gatekeeper | Self | Tuned thresholds | Answer | Defer | Post, FT |
| Rational Tuning | Self | Copula confidence | Stage | Stop | Post |
| Speculative Cascading | Hybrid | Token/step confidence | Token/stage | Handoff, stop | Mid |
| Pre-Call Routing | |||||
| OptLLM | External | Suitability uncertainty | Query | Route | Pre, Aux |
| RouteLLM | External | Router prediction | Query | Route | Pre, Aux |
| Hybrid LLM | External | Quality gap prediction | Query | Route | Pre, Aux |
| CARGO | Auxiliary | Score gap + tie-break | Query | Route | Pre, Aux |
| Semantic Entropy Routing | Self | Semantic entropy | Answer | Offload | Post, MS |
| Self-REF | Self | Confidence + token prob | Answer | Route, reject | Post, FT |
| Hybrid Systems | |||||
| Unified Routing & Cascading | Hybrid | Ex-ante + post-hoc quality | Query/answer | Route, defer | Pre, Post |
| Select-Then-Route | Hybrid | Taxonomy + judge agreement | Query/answer | Shortlist, defer | Pre, Post, Aux |
Discussion
Routing systems must account for timing, signal type, and cost tradeoffs:
- Pre-call routing uses predicted suitability, which is inherently uncertain. Threshold tuning and cost-aware optimization become critical. Methods like OptLLM, RouteLLM, and Hybrid LLM all invest in training or bootstrapping better predictors.
- Post-hoc cascading uses observed answer confidence, which is more direct but comes after the cost of the first model is already incurred. Works best when the cheap model succeeds often; escalation is the exception.
- Mid-generation routing (speculative cascading) combines both: run a fast model with speculative execution, defer mid-stream if confidence drops. Balances latency and cost.
Best practice: Layer ex-ante pruning (filter out clearly-wrong models before execution), post-hoc checking (catch failures from chosen model), and selective escalation (deferred execution only when necessary).