How much does confidence-guided cascading between SAS and MAS improve accuracy?
This reads 'SAS' as single-agent and 'MAS' as multi-agent systems, and asks whether using a model's own confidence to decide when to escalate from the cheap single-agent path to the expensive multi-agent one actually buys accuracy — but the corpus has no note that measures that specific SAS↔MAS handoff head-to-head, so what follows reconstructs the answer from the mechanism it shares with confidence-gated routing generally.
This reads 'SAS' as single-agent and 'MAS' as multi-agent systems, with confidence-guided cascading meaning: run the cheap path first, and only escalate the hard cases. The library doesn't contain a note that puts a number on that exact SAS-to-MAS accuracy gain — so the honest answer is that the precise delta you're after isn't in the collection. But the collection is unusually rich on the underlying bet that cascading depends on, and that's where the interesting signal is.
The central finding the corpus keeps returning to is that a model's own confidence is a surprisingly good gate for *when to spend more compute*. Can simple uncertainty estimates beat complex adaptive retrieval? is the closest structural analog to your question: calibrated token-probability uncertainty decides when to fire an expensive retrieval call, and it beats more elaborate adaptive schemes while using a fraction of the calls — matching performance on hard multi-hop tasks at far lower cost. That's exactly the cascade logic (cheap-by-default, escalate-on-low-confidence), just with retrieval standing in for the multi-agent stage. Does step-level confidence outperform global averaging for trace filtering? sharpens it further: local, step-level confidence catches reasoning breakdowns that whole-trace averaging hides, and lets you *stop early* — so the granularity of the confidence signal, not just its presence, determines how much you save.
The deeper question a cascade designer should ask is whether the confidence signal is trustworthy enough to route on. The corpus is split in an instructive way. On the optimistic side, Can model confidence alone replace external answer verification? and Can model confidence work as a reward signal for reasoning? show intrinsic probability is strong enough to *replace external verifiers* as a training signal, and Does model confidence predict robustness to prompt changes? finds high confidence genuinely tracks robustness. On the skeptical side, Can pretraining data statistics detect hallucinations better than model confidence? is the warning shot: models stay confidently wrong on entity combinations they never saw in training, so a pure-confidence gate will route those straight down the cheap path and miss them. That single note is the strongest argument that confidence-guided cascading has a blind spot a static accuracy number would paper over.
The other thing worth knowing — which bears directly on whether the *MAS* tier earns its cost — is that multi-agent escalation isn't free of failure. Can agents evaluate AI outputs more reliably than language models? reports an agentic evaluator beating LLM-as-judge by ~100x, but its memory module *cascaded errors* through the pipeline, revealing that multi-agent systems need explicit error-isolation to keep their gains. So the accuracy you'd recover by escalating to MAS can be partly eaten by the new failure modes MAS introduces — meaning the real comparison isn't 'single vs. multi' but 'how well-isolated is the multi-agent tier you escalate into.'
The useful takeaway the corpus leaves you with: the value of confidence-guided cascading is bounded less by the cleverness of the routing than by two things it rarely controls for — whether confidence is calibrated on the inputs that actually matter (it isn't, for unseen combinations), and whether the expensive tier you escalate into is built to contain its own errors. If you want a real number, those are the two variables to hold fixed first.
Sources 7 notes
Calibrated token-probability uncertainty consistently beats multi-call adaptive retrieval on single-hop tasks and matches performance on multi-hop, using a fraction of the LM and retriever calls. The model's self-knowledge proves more reliable than external heuristics for deciding when to retrieve.
Local step-level confidence catches reasoning breakdowns that global averaging masks and enables early stopping before traces complete. This approach achieves comparable accuracy gains to naive majority voting with far fewer generated traces, proving trace quality matters more than quantity.
RLPR and INTUITOR successfully extend reinforcement learning for reasoning to general domains by using the model's own token probabilities and confidence levels as reward signals, eliminating the need for external verifiers or reference answers.
RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.
ProSA found that when models are highly confident, they resist prompt rephrasing; low confidence causes major output swings. Larger models, few-shot examples, and objective tasks all correlate with higher confidence and greater robustness.
QuCo-RAG uses entity co-occurrence patterns from training data to trigger retrieval, successfully flagging hallucination risk even when models are highly confident. This data-side approach catches the root cause (unseen combinations) rather than the symptom (low confidence).
Eight-module agentic evaluation achieved 0.27% judge shift versus 31% for LLM-as-a-Judge on complex tasks. However, the memory module cascaded errors, revealing that agentic systems need error isolation mechanisms to maintain gains.