Should RAG systems use model confidence or data rarity to trigger retrieval?

Internal uncertainty and pretraining-data rarity signals catch different failure modes in RAG. This explores whether one signal suffices or both are needed to prevent hallucination across different failure types.

Synthesis note · 2026-05-03

Two competing answers exist for "when should a RAG system trigger retrieval?" FLARE answers: when model confidence drops below a threshold during generation. QuCo-RAG answers: when the entities or claims in the query are rare in pretraining data. The papers frame these as alternative mechanisms. They are not — the signals catch orthogonal failure modes, and the right design combines them.

Internal uncertainty (FLARE-style) catches cases where the model recognizes its own ignorance: low log-probability over generated tokens, high entropy over candidate continuations, semantic drift mid-generation. The model knows it does not know, and the trigger fires. This works well when the failure mode is uncertainty-correlated — paraphrasing common knowledge, summarizing seen content, generating in well-trodden territory. It fails when the model is confidently wrong: pretraining bias produces high-confidence outputs about rare entities the model has never seen enough of to be correctly calibrated about. Calibration error is precisely the regime where internal uncertainty is silent.

External rarity (QuCo-style) catches cases where the model has no business being confident: query entities that co-occurred fewer than k times in pretraining, claims about specific quantities or dates that are easily fabricated, named entities outside the model's training distribution. The signal is computed from the corpus, not from the model's state, so it works precisely where calibration has failed. It fails when the model is uncertain about common knowledge — a stylistic ambiguity, an in-context contradiction, a multi-step inference that compounds error. Pretraining frequency says "you should know this" while the model in fact does not.

The two signals are nearly orthogonal. FLARE catches known unknowns; QuCo catches unknown unknowns. A retrieval policy using only one will systematically underfire on the failures the other catches. The composite policy is straightforward: trigger if either signal exceeds its threshold, with the union covering the calibration gap that single-signal policies leave open. The framing also explains why fixed-interval retrieval (e.g., retrieve every k tokens) underperforms both: fixed intervals waste retrieval budget on confident-correct generation and miss the prompts where neither signal naturally fires together.

The implication for RAG architecture: retrieval triggering is not a single-signal classification problem but a dual-channel calibration problem, and the channels measure different things. Building either channel without the other leaves a known failure surface uncovered.

Inquiring lines that use this note as a source 18

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

14 direct connections · 88 in 2-hop network ·medium cluster Open in graph ↗

Should RAG systems use model confidence or data … When should retrieval happen during model generati… Can pretraining data statistics detect hallucinati… Can RAG systems refuse to answer without reliable … Can smaller models handle RAG filtering while larg…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

When should retrieval happen during model generation? Explores whether retrieval should occur continuously, at fixed intervals, or only when the model signals uncertainty. Standard RAG retrieves once; long-form generation requires dynamic triggering based on confidence signals.
the FLARE-style internal-uncertainty channel; necessary but not sufficient
Can pretraining data statistics detect hallucinations better than model confidence? Explores whether checking whether entity combinations appeared in training data is a more reliable hallucination signal than measuring the model's own confidence levels, especially for catching confidently-wrong outputs.
the QuCo-style external-rarity channel; necessary but not sufficient
Can RAG systems refuse to answer without reliable evidence? Explores whether retrieval-augmented generation can be designed to abstain from answering when sources are corrupted or insufficient, rather than filling gaps with plausible-sounding guesses. This matters for historical text where OCR errors and language drift are common.
downstream: once retrieval fires, generation must be evidence-conditional; the trigger is upstream of refusal behavior
Can smaller models handle RAG filtering while larger models focus on synthesis? Does splitting RAG pipeline work between cheaper small models and expensive large models improve both cost and quality? The question asks whether different pipeline stages have different optimal model sizes.
composes with the dual-trigger: tier the retrieval decision (cheap rarity check first, then expensive uncertainty probe)

Should RAG systems use model confidence or data rarity to trigger retrieval?

Related concepts in this collection 4

Related papers in this collection 8

Search by related questions 4