CLaRa: Bridging Retrieval and Generation with Continuous Latent Reasoning

Paper · arXiv 2511.18659 · Published November 24, 2025
Retrieval-Augmented Generation (RAG)

Retrieval-augmented generation (RAG) enhances large language models (LLMs) with external knowledge but still suffers from long contexts and disjoint retrieval–generation optimization. In this work, we propose CLaRa (Continuous Latent Reasoning), a unified framework that performs embedding-based compression and joint optimization in a shared continuous space. To obtain semantically rich and retrievable compressed vectors, we introduce SCP, a key-preserving data synthesis framework using QA and paraphrase supervision. CLaRa then trains the reranker and generator end-to-end via a single language modeling loss, with gradients flowing through both modules using a differentiable top-k estimator. Theoretically, this unified optimization aligns retrieval relevance with answer quality. Experiments across multiple QA benchmarks show that CLaRa achieves state-of-the-art compression and reranking performance, often surpassing text-based fine-tuned baselines.

Introduction. Retrieval-Augmented Generation (RAG) has become a powerful paradigm for enhancing large language models (LLMs) across diverse NLP tasks (Lewis et al., 2020; Gao et al., 2024; Li et al., 2024b; Wu et al., 2024; Abootorabi et al., 2025). By incorporating external evidence, RAG mitigates key weaknesses of LLMs such as hallucination (Ayala & Bechard, 2024) and knowledge obsolescence (Lau et al., 2025). Most RAG systems suffer from a fundamental structural issue: retrieval and generation are optimized separately. Retrievers select documents based on surface-level similarity, while generators produce answers without providing feedback about what information is truly needed (Shi et al., 2025). This disjoint design leads to two intertwined challenges. (1) Optimization. Because document selection is discrete, gradients cannot flow from the generator back to the retriever (Sachan et al., 2021; Lin et al., 2024), hindering joint training and preventing the retriever from aligning with the generator’s task objective. (2) Efficiency.

Discussion / Conclusion. In this paper, we address the challenge of compressing documents into high-quality implicit representations to enhance the performance of retrieval-augmented generation (RAG) systems that rely on document embeddings for question answering. To this end, we design multiple pretraining objectives that leverage LLM prompting to construct diverse supervision signals, including QA pairs—covering both simple and compositional reasoning—and paraphrased documents, encouraging the compressor to retain essential semantic information. We further introduce an efficient end-to-end training framework that unifies document representations across the reranking and generation stages, leading to substantial improvements in retrieval accuracy and answer quality. Extensive experiments on multiple QA benchmarks demonstrate that embedding-based contextual compression not only reduces input length and computation cost but also bridges the gap between retrieval and generation, enabling a more unified and semantically coherent RAG paradigm. Compressor Generalization. The current compressor is pretrained exclusively on Wikipedia data.