Could eliminating retrieval entirely work better than shifting the burden?
This explores whether *getting rid of retrieval altogether* — folding what RAG fetches into the model itself — could beat the popular fix of just moving the work from the retriever to a long-context reader.
This explores whether eliminating retrieval altogether could outperform the now-common move of *shifting the burden* rather than removing it. The clearest version of "shift the burden" is LongRAG, which argues the old heavy-retriever/light-reader split was a historical artifact of small context windows: feed coarse 4K-token chunks to a long-context reader and let deep reading do the work precise retrieval used to do Can long-context models resolve retriever-reader imbalance?. Retrieval doesn't vanish here — it gets cheaper and sloppier, and the reader absorbs the cost.
The genuinely retrieval-free alternative is compressive memory: COMEDY drops the vector database entirely and folds memory generation, compression, and response into a single model pass, tracking event recaps and user portraits instead of fetching them Can a single model replace retrieval for long-term conversation memory?. So the answer to the literal question is sobering — the same note reports that continuous reprocessing follows an *inverted-U curve* and can degrade below a no-memory baseline through misgrouping, context loss, and overfitting. Eliminating retrieval doesn't eliminate the failure mode; it relocates it from "fetched the wrong chunk" to "compressed away the thing you needed."
The more interesting lesson from the corpus is that "shift the burden" and "eliminate retrieval" aren't the only two options — and that the burden may be in the wrong place to begin with. Several notes suggest the real win is the model deciding *whether to retrieve at all*. DeepRAG frames each reasoning step as a choice between external lookup and internal parametric knowledge, and its 22% accuracy gain comes largely from *not* retrieving when unnecessary noise would hurt When should language models retrieve external knowledge versus use internal knowledge?. Uncertainty estimation pushes the same idea further: a calibrated read of the model's own token probabilities decides when to reach outside, beating complex adaptive-retrieval pipelines at a fraction of the compute Can simple uncertainty estimates beat complex adaptive retrieval?. And proactive tool retrieval lets the model emit its own structured requests rather than have a passive matcher guess for it Can models decide better than retrievers which tools to use?.
There's also a quieter route: make retrieval better instead of bigger or absent. Fine-tuning the retriever on implicit queries matches augmented pipelines without expanding input length — the model learns to resolve ambiguity through training rather than at query time Can fine-tuning replace query augmentation for retrieval?, and you can even adapt a retriever to a new domain from a short text description alone, with no target data Can you adapt retrieval models without accessing target data?. That reframes the whole question: the binary of "keep retrieval and overload the reader" vs. "kill retrieval" assumes retrieval is the bottleneck. The corpus keeps suggesting it's the *control logic* — when, whether, and how confidently to retrieve — that decides the outcome Does supervising retrieval steps outperform final answer rewards?.
So: eliminating retriely can work, but only where the knowledge is small and stable enough to compress safely (conversational memory, user models) — and even there the inverted-U warns you that more reprocessing eventually hurts. For open, drifting, or noisy knowledge, the better answers in this collection aren't "remove" or "relocate" the burden but *gate* it intelligently, and let the model abstain when grounding is weak Can RAG systems refuse to answer without reliable evidence?.
Sources 9 notes
LongRAG shows that 4K-token units and long-context readers outperform 100-word retrieval on standard benchmarks. The optimal RAG design shifts from precise retrieval to coarse ranking plus deep reading as context windows expanded.
COMEDY merges memory generation, compression, and response into one operation, tracking event recaps, user portraits, and relationship dynamics without vector-DB retrieval. However, empirical work shows continuous reprocessing follows an inverted-U curve, degrading below no-memory baseline due to misgrouping, context loss, and overfitting.
DeepRAG models each reasoning step as a Markov Decision Process where the model learns when to retrieve versus rely on parametric knowledge. The 21.99% improvement comes from better-targeted retrieval and elimination of noise from unnecessary external knowledge.
Calibrated token-probability uncertainty consistently beats multi-call adaptive retrieval on single-hop tasks and matches performance on multi-hop, using a fraction of the LM and retriever calls. The model's self-knowledge proves more reliable than external heuristics for deciding when to retrieve.
MCP-Zero shows that letting models emit structured tool requests iteratively across conversations outperforms single-round semantic matching. The model can refine requirements progressively across domains as reasoning unfolds, bypassing colloquial-to-formal vocabulary mismatch.
Fine-tuned semantic search models trained on implicit queries match the performance of augmented pretrained retrievers without expanding input length. The model learns to resolve ambiguity through training rather than requiring explicit augmentation.
Research demonstrates that a brief textual domain description suffices to generate synthetic training data for retrieval fine-tuning, outperforming baselines in zero-target-access scenarios and enabling adaptation where conventional methods are blocked.
Fine-grained feedback on intermediate retrieval steps significantly boosts agentic RAG performance compared to final-answer-only rewards. DPO trained with both positive and negative step feedback outperforms PPO and single-direction training by directly contrasting good and bad retrieval chains.
A multilingual RAG system for noisy historical newspapers succeeds by aggressively expanding retrieval while constraining generation to only grounded answers. The grounded-refusal prompt prevents hallucination when OCR errors and language drift degrade source quality, trading coverage for integrity.