Can retrieval strategies drive both draft refinement and new research question generation?

This explores whether one mechanism — using retrieval as a feedback loop — can serve two jobs at once: tightening an existing draft, and surfacing the new questions a researcher should ask next.

This reads the question as asking whether retrieval is just a fetch-and-fill step or something closer to an engine that both refines what you've written and tells you what to investigate next. The corpus suggests it can be both — and the bridge between the two jobs is the same insight: a partial draft is itself a signal about what's missing.

Start with refinement. One framing treats research writing as diffusion-style denoising: you hold a persistent draft skeleton and repeatedly improve it through targeted retrieval rather than writing top-to-bottom in one pass Can iterative revision cycles match how humans actually write?. Each retrieval step is aimed at a rough patch in the current draft, which keeps the whole thing globally coherent instead of locally patched. That's retrieval *driving* refinement — the draft's weak spots decide what gets pulled.

The more surprising half is question generation, and the key note is that a model's own partial answer reveals information needs the original query couldn't express Can a model's partial response guide what to retrieve next?. When you feed a generated response back in as the next retrieval query, you surface implicit gaps — which is functionally the same as generating a new, sharper sub-question. So the loop that refines a draft and the loop that proposes new lines of inquiry are mechanically the same loop, just read in two directions: the gap can be filled (refinement) or pursued (a new question). This is why systems that separate query *planning* from answer *synthesis* outperform flat ones on multi-hop work Do hierarchical retrieval architectures outperform flat ones on complex queries? — the planning component is precisely where 'what should I ask next' lives as a first-class step.

But the corpus also names the failure mode you'd worry about. If retrieval can generate new questions, it can also generate confident garbage: deep research agents fabricate examples and evidence to satisfy a demand for depth, accounting for a large share of their failures Why do deep research agents fabricate scholarly content?. The proposed guardrail is making generation earn its place — letting a system grow its own corpus from its outputs only when those outputs pass entailment, attribution, and novelty checks Can RAG systems safely learn from their own generated answers?, or refusing to answer at all when evidence is too thin Can RAG systems refuse to answer without reliable evidence?. Without that gate, a question-generating loop just compounds its own hallucinations.

Two deeper caveats reframe the whole thing. First, not every question wants the same retrieval — question *type* determines strategy, so a comparison or debate question needs aspect-specific retrieval while a factoid suits standard RAG Does question type determine the right retrieval strategy?. A loop that generates new questions had better also classify them. Second, retrieval works best when it's trained on whether documents actually *helped* the answer, not just whether they looked similar Can retrieval learn what actually helps answer questions? — which is exactly the signal a draft-refinement loop produces for free. The thing you didn't know you wanted to know: drafting and question-generation aren't two features to build separately. They're the forward and reverse readings of a single retrieval-feedback loop, and the corpus's open problem is governing it Where do retrieval systems fail and why? so it generates real questions instead of plausible fictions.

Sources 9 notes

Can iterative revision cycles match how humans actually write?

Research writing follows a draft-and-revise pattern analogous to diffusion sampling, where a persistent draft skeleton is iteratively denoised through targeted retrieval steps. This architecture maintains global coherence better than linear pipelines while mirroring cognitive studies of actual human writing.

Can a model's partial response guide what to retrieve next?

ITER-RETGEN shows that iteratively using generated responses as retrieval queries substantially improves performance on multi-hop reasoning and fact verification. Generation acts as both answer producer and information-need clarifier, surfacing implicit gaps that the original query missed.

Do hierarchical retrieval architectures outperform flat ones on complex queries?

Separating query planning from answer synthesis into distinct components reduces interference and improves multi-hop query performance. This architectural principle mirrors documented benefits of separating planning from execution in agent design.

Why do deep research agents fabricate scholarly content?

Analysis of 1,000 failure reports reveals 39% of agent failures stem from strategic content fabrication—inventing examples, products, and false evidence—to mimic scholarly rigor when actual research depth is demanded.

Can RAG systems safely learn from their own generated answers?

Systems can add generated answers to their retrieval corpus when outputs pass entailment verification, source attribution checks, and novelty detection. This prevents hallucinations from polluting future retrievals while allowing genuine knowledge accumulation.

Can RAG systems refuse to answer without reliable evidence?

A multilingual RAG system for noisy historical newspapers succeeds by aggressively expanding retrieval while constraining generation to only grounded answers. The grounded-refusal prompt prevents hallucination when OCR errors and language drift degrade source quality, trading coverage for integrity.

Does question type determine the right retrieval strategy?

Research shows non-factoid questions split into five types, each requiring different retrieval and aggregation methods. Evidence-based questions suit standard RAG, while debate and comparison need aspect-specific retrieval, and experience/reason questions need decomposition or filtering strategies.

Can retrieval learn what actually helps answer questions?

CLaRa propagates generator loss back through continuous document representations, allowing retrievers to optimize for documents that actually improve answers rather than surface similarity. The gap between relevance and usefulness closes when retrieval receives direct feedback from generation success.

Where do retrieval systems fail and why?

RAG systems fail at three structural levels: adaptive triggering (fixed intervals waste context), semantic-task mismatch (embeddings measure association, not relevance), and mathematical limits (embedding dimension constrains representable document sets). These require fundamentally different retrieval approaches, not tuning.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing whether retrieval strategies can simultaneously refine drafts AND generate new research questions. The question remains open; the constraints may have shifted.

What a curated library found — and when (dated claims, not current truth):
Library findings span 2024–2025:
• Draft refinement via targeted retrieval works as a diffusion-like denoising loop, keeping global coherence (2024–2025).
• A model's partial response reveals implicit information needs, collapsing the distinction between draft refinement and question generation into one feedback loop (2024–2025).
• Hierarchical systems separating query planning from answer synthesis outperform flat ones on multi-hop retrieval (2024–2025).
• Deep research agents fabricate evidence to satisfy depth demands; the failure rate is high without entailment/attribution gates (2025).
• Question type (factoid vs. non-factoid vs. debate) determines optimal retrieval strategy; unified loops must classify generated questions (2025).
• Joint optimization of retriever and generator via shared continuous representations outperforms similarity-only ranking (2024–2025).

Anchor papers (verify; mind their dates):
• arXiv:2407.01219 (2024-07): Best practices in RAG.
• arXiv:2503.15879 (2025-03): Typed-RAG and question-type awareness.
• arXiv:2512.01948 (2025-12): How far are we from genuinely useful deep research agents?
• arXiv:2511.18659 (2025-11): CLaRa — bridging retrieval and generation via continuous latent reasoning.

Your task:
(1) RE-TEST EACH CONSTRAINT. For each claim above, assess whether post-2025-12 models, training methods (e.g., reinforcement learning for RAG, chain-of-thought scaling), tooling (agentic frameworks, memory/caching orchestration), or evaluation have relaxed or overturned it. Separate the durable question (likely still open) from the perishable limitation. Does the fabrication problem persist? Do question-type classifiers now come standard, or are they still an add-on?
(2) Surface the strongest contradicting or superseding work from the last ~6 months. Has anyone shown that unified retrieval-generation loops *don't* generate new questions reliably, or that draft refinement and question generation require fundamentally different strategies?
(3) Propose 2 research questions that assume the regime may have moved—e.g., what if reinforcement learning on retrieval-aided drafting has made hierarchical planning obsolete? Or: can end-to-end fine-tuned models skip the question-type classification bottleneck?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Can retrieval strategies drive both draft refinement and new research question generation?

Sources 9 notes

Next inquiring lines