Should RAG systems use model confidence or data rarity to trigger retrieval?
Internal uncertainty and pretraining-data rarity signals catch different failure modes in RAG. This explores whether one signal suffices or both are needed to prevent hallucination across different failure types.
Two competing answers exist for "when should a RAG system trigger retrieval?" FLARE answers: when model confidence drops below a threshold during generation. QuCo-RAG answers: when the entities or claims in the query are rare in pretraining data. The papers frame these as alternative mechanisms. They are not — the signals catch orthogonal failure modes, and the right design combines them.
Internal uncertainty (FLARE-style) catches cases where the model recognizes its own ignorance: low log-probability over generated tokens, high entropy over candidate continuations, semantic drift mid-generation. The model knows it does not know, and the trigger fires. This works well when the failure mode is uncertainty-correlated — paraphrasing common knowledge, summarizing seen content, generating in well-trodden territory. It fails when the model is confidently wrong: pretraining bias produces high-confidence outputs about rare entities the model has never seen enough of to be correctly calibrated about. Calibration error is precisely the regime where internal uncertainty is silent.
External rarity (QuCo-style) catches cases where the model has no business being confident: query entities that co-occurred fewer than k times in pretraining, claims about specific quantities or dates that are easily fabricated, named entities outside the model's training distribution. The signal is computed from the corpus, not from the model's state, so it works precisely where calibration has failed. It fails when the model is uncertain about common knowledge — a stylistic ambiguity, an in-context contradiction, a multi-step inference that compounds error. Pretraining frequency says "you should know this" while the model in fact does not.
The two signals are nearly orthogonal. FLARE catches known unknowns; QuCo catches unknown unknowns. A retrieval policy using only one will systematically underfire on the failures the other catches. The composite policy is straightforward: trigger if either signal exceeds its threshold, with the union covering the calibration gap that single-signal policies leave open. The framing also explains why fixed-interval retrieval (e.g., retrieve every k tokens) underperforms both: fixed intervals waste retrieval budget on confident-correct generation and miss the prompts where neither signal naturally fires together.
The implication for RAG architecture: retrieval triggering is not a single-signal classification problem but a dual-channel calibration problem, and the channels measure different things. Building either channel without the other leaves a known failure surface uncovered.
Inquiring lines that use this note as a source 18
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- What makes reranking during retrieval better than catching failures at plan time?
- What makes diverse failure modes more informative than single failure examples?
- How should enterprises choose between graph and vector approaches for RAG?
- How does training frequency distribution shape what models reliably retrieve?
- What makes process-level supervision better than outcome-only rewards for RAG training?
- How do RAG and prompting techniques differ in supporting each granularity level?
- Why do rare cases in medicine and science require models that preserve tail distributions?
- Why do RAG systems fail when demo queries work correctly?
- Should retrieval be triggered by model uncertainty or fixed intervals?
- How does response content compare to model confidence as a retrieval trigger?
- Does statistical rarity actually correlate with originality that law should protect?
- Why does model confidence fail to detect hallucinations on rare entity pairs?
- Why does model confidence fail to detect hallucinations about rare entities?
- What threshold combinations for uncertainty and rarity signals maximize RAG performance?
- What five requirements do enterprise RAG systems need beyond accuracy?
- Can adaptive retrieval triggered by model uncertainty improve RAG reliability?
- How should retrieval triggers use model uncertainty instead of fixed intervals?
- What concrete failures happen when RAG ignores temporal relevance?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
When should retrieval happen during model generation?
Explores whether retrieval should occur continuously, at fixed intervals, or only when the model signals uncertainty. Standard RAG retrieves once; long-form generation requires dynamic triggering based on confidence signals.
the FLARE-style internal-uncertainty channel; necessary but not sufficient
-
Can pretraining data statistics detect hallucinations better than model confidence?
Explores whether checking whether entity combinations appeared in training data is a more reliable hallucination signal than measuring the model's own confidence levels, especially for catching confidently-wrong outputs.
the QuCo-style external-rarity channel; necessary but not sufficient
-
Can RAG systems refuse to answer without reliable evidence?
Explores whether retrieval-augmented generation can be designed to abstain from answering when sources are corrupted or insufficient, rather than filling gaps with plausible-sounding guesses. This matters for historical text where OCR errors and language drift are common.
downstream: once retrieval fires, generation must be evidence-conditional; the trigger is upstream of refusal behavior
-
Can smaller models handle RAG filtering while larger models focus on synthesis?
Does splitting RAG pipeline work between cheaper small models and expensive large models improve both cost and quality? The question asks whether different pipeline stages have different optimal model sizes.
composes with the dual-trigger: tier the retrieval decision (cheap rarity check first, then expensive uncertainty probe)
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- UR2: Unify RAG and Reasoning through Reinforcement Learning
- Retrieval-augmented reasoning with lean language models
- Towards Agentic RAG with Deep Reasoning: A Survey of RAG-Reasoning Systems in LLMs
- Revisiting RAG Ensemble: A Theoretical and Mechanistic Analysis of Multi-RAG System Collaboration
- CLaRa: Bridging Retrieval and Generation with Continuous Latent Reasoning
- A Hybrid RAG System with Comprehensive Enhancement on Complex Reasoning
- Chain-of-Retrieval Augmented Generation
- Ranking Free RAG: Replacing Re-ranking with Selection in RAG for Sensitive Domains
Original note title
retrieval triggers should combine internal-uncertainty signals with external-rarity signals — model confidence misses pretraining-frequency hallucination risk and vice versa