Does filtering passages before generation improve large model answer quality?

This explores whether screening retrieved passages — keeping only grounded, relevant ones before the model writes — actually produces better answers, or whether more context is simply better.

This reads the question as being about the gate between retrieval and generation: does deciding what the model is allowed to see (and what it's allowed to say) improve the answer, versus just stuffing everything in? The corpus answers with a fairly strong 'yes, filtering helps' — but the more interesting finding is *why* it helps, and it's not the reason you'd guess.

The intuitive case for filtering is noise removal, and the corpus has a sharp example. A multilingual RAG system reading badly OCR'd historical newspapers wins not by retrieving better, but by *constraining generation to only grounded answers* and refusing when the evidence is too degraded Can RAG systems refuse to answer without reliable evidence?. The filter there is a refusal gate: trade coverage for integrity, and hallucination drops. A related idea pushes the gate to the other end of the pipeline — only letting a generated answer back into the retrieval corpus if it passes entailment and novelty checks, so bad passages never accumulate in the first place Can RAG systems safely learn from their own generated answers?.

But here's the part you might not expect: filtering helps even when the passages aren't noisy, because *more context actively degrades reasoning.* One study shows accuracy falling from 92% to 68% with just 3000 tokens of padding — far below any context limit, task-agnostic, and not fixed by chain-of-thought Does reasoning ability actually degrade with longer inputs?. So irrelevant-but-harmless passages aren't neutral; they cost you. This reframes pre-generation filtering as a reasoning-preservation move, not just a cleanliness move. The same logic appears in agent search, where capping how much the model reasons *per turn* preserves the context budget it needs to actually use new evidence Does limiting reasoning per turn improve multi-turn search quality?.

There's a limit worth knowing. Long-context models can sometimes absorb the filtering job themselves — matching RAG on semantic retrieval without explicit training — but they collapse on structured, relational queries that need joins across tables Can long-context LLMs replace retrieval-augmented generation systems?. So 'just give the big model everything' works for fuzzy lookup and fails exactly where a disciplined retrieval-and-filter step would have helped most.

The sharpest twist: filtering shouldn't be only a one-shot pre-generation step. ITER-RETGEN shows that the model's own partial answer reveals information gaps the original query couldn't express — so you generate a little, use that draft to re-filter and re-retrieve, then continue Can a model's partial response guide what to retrieve next?. The best 'filter' isn't a static screen before generation; it's a loop where generation tells you what to keep next.

Sources 6 notes

Can RAG systems refuse to answer without reliable evidence?

A multilingual RAG system for noisy historical newspapers succeeds by aggressively expanding retrieval while constraining generation to only grounded answers. The grounded-refusal prompt prevents hallucination when OCR errors and language drift degrade source quality, trading coverage for integrity.

Can RAG systems safely learn from their own generated answers?

Systems can add generated answers to their retrieval corpus when outputs pass entailment verification, source attribution checks, and novelty detection. This prevents hallucinations from polluting future retrievals while allowing genuine knowledge accumulation.

Does reasoning ability actually degrade with longer inputs?

FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.

Does limiting reasoning per turn improve multi-turn search quality?

Unrestricted reasoning within single search turns consumes context needed for subsequent retrieval rounds, degrading the agent's ability to incorporate new evidence. Setting per-turn reasoning budgets, not just overall time limits, prevents this context erosion and maintains search quality across iterations.

Can long-context LLMs replace retrieval-augmented generation systems?

The LOFT benchmark shows LCLMs match RAG on semantic retrieval without explicit training, but cannot execute relational queries requiring joins across structured tables. Context length alone cannot bridge this gap.

Can a model's partial response guide what to retrieve next?

ITER-RETGEN shows that iteratively using generated responses as retrieval queries substantially improves performance on multi-hop reasoning and fact verification. Generation acts as both answer producer and information-need clarifier, surfacing implicit gaps that the original query missed.

Does filtering passages before generation improve large model answer quality?

Sources 6 notes

Next inquiring lines