Why do external feature triggers outperform uncertainty on complex questions?
This explores why, when deciding whether a RAG system should reach for external knowledge, lightweight features of the question itself can beat asking the model how uncertain it is — specifically on harder, multi-step questions.
This explores why, when deciding whether to retrieve, cheap features read off the question can beat measuring the model's own uncertainty — and why that edge shows up on complex questions. The corpus has a genuine head-to-head here. One line of work shows that a learned predictor using 27 lightweight external question features matches expensive uncertainty-based methods overall while costing far less, and pulls ahead specifically on complex questions Can question features alone predict when to retrieve?. The opposing line argues the reverse: calibrated token-probability uncertainty is more reliable than external heuristics, beating multi-call adaptive retrieval on single-hop tasks Can simple uncertainty estimates beat complex adaptive retrieval?. Read together, the tension resolves cleanly — uncertainty wins where the model's self-knowledge is well-calibrated (simple, single-hop questions), and external features win where it isn't (complex, multi-hop).
Why does calibration break down on hard questions? Several notes point at the same culprit: a model's confidence is a signal about *itself*, not about the question's difficulty, and that signal degrades exactly when reasoning gets long. Confidence predicts robustness on objective, simple tasks but swings wildly when the model is unsure Does model confidence predict robustness to prompt changes?. Worse, reasoning models can be confidently wrong — they overthink ill-posed questions and never learn when to disengage Why do reasoning models overthink ill-posed questions?, and irrelevant text can spike their error rate 300% without denting their certainty How vulnerable are reasoning models to irrelevant text?. So on complex questions, the uncertainty signal is measuring a quantity that has quietly stopped tracking truth. The question's own surface features — type, structure, decomposability — don't have that failure mode.
That's the deeper insight the corpus offers: external question features work because *the question's shape predicts what kind of help it needs* independent of how the model feels about it. Non-factoid questions split into five types, each demanding a different retrieval and aggregation strategy — debate and comparison questions need aspect-specific retrieval, experience questions need decomposition Does question type determine the right retrieval strategy?. Even whether step-by-step reasoning helps at all depends on question semantics flowing through the prompt, not on task category Why do some questions perform better without step-by-step reasoning?. Complex questions are precisely the ones with rich enough structure for these features to discriminate; simple questions are nearly featureless, which is why uncertainty (cheap and adequate) wins there instead.
The lateral payoff: this isn't really a contest between two retrieval triggers — it's about where the *decision signal* should live. DeepRAG frames the retrieve-or-not choice as a per-step Markov Decision Process the model learns, gaining 22% by switching knowledge sources selectively When should language models retrieve external knowledge versus use internal knowledge?, and the broader RAG synthesis insists retrieval must adapt dynamically and couple tightly to reasoning rather than follow fixed rules How should systems retrieve and reason with external knowledge?. Question features and uncertainty are two cheap proxies for a decision that, done fully, wants to be learned and step-wise. The practical takeaway is a routing rule, not a winner: lean on the model's confidence when questions are simple and it's calibrated; lean on the question's external features when complexity has corroded that calibration.
Sources 9 notes
Learned predictors using 27 lightweight external question features match complex uncertainty-based methods on overall performance while costing far less, and outperform them on complex questions across 6 QA datasets.
Calibrated token-probability uncertainty consistently beats multi-call adaptive retrieval on single-hop tasks and matches performance on multi-hop, using a fraction of the LM and retriever calls. The model's self-knowledge proves more reliable than external heuristics for deciding when to retrieve.
ProSA found that when models are highly confident, they resist prompt rephrasing; low confidence causes major output swings. Larger models, few-shot examples, and objective tasks all correlate with higher confidence and greater robustness.
Reasoning models generate redundant, lengthy responses to questions with missing premises while non-reasoning models correctly identify them as unanswerable. Training optimizes for producing reasoning steps but never teaches models when to disengage.
Appending semantically unrelated sentences to math problems significantly increases error rates in reasoning models. These query-agnostic triggers discovered on cheaper models transfer effectively to stronger models and also inflate response length.
Research shows non-factoid questions split into five types, each requiring different retrieval and aggregation methods. Evidence-based questions suit standard RAG, while debate and comparison need aspect-specific retrieval, and experience/reason questions need decomposition or filtering strategies.
Saliency analysis reveals that CoT prompting fails when question information doesn't aggregate into the prompt structure before reasoning begins. For simple questions, direct question-to-answer flow outperforms step-by-step reasoning, showing the optimal prompt depends on question type, not just task category.
DeepRAG models each reasoning step as a Markov Decision Process where the model learns when to retrieve versus rely on parametric knowledge. The 21.99% improvement comes from better-targeted retrieval and elimination of noise from unnecessary external knowledge.
Research shows retrieval should adapt dynamically rather than follow fixed patterns, reasoning and retrieval must integrate closely, and embedding-based retrieval has fundamental limits requiring architectural alternatives.