How do archive systems handle knowledge that changes with each generation?
This reads 'each generation' as each round of an AI's own output — asking how a knowledge archive stays trustworthy when the system keeps adding, revising, or building on what it just generated, rather than asking about human generations over time.
This explores what happens to a knowledge store when the AI's own outputs feed back into it — when the corpus isn't fixed but grows and shifts with every cycle of generation. The corpus has a surprisingly direct answer, and it centers on a single tension: letting a system learn from itself is how knowledge accumulates, but it's also how errors compound.
The cleanest treatment is bidirectional RAG, where a system writes its own generated answers back into the retrieval corpus — but only through a gate. Outputs have to pass entailment checks, source attribution, and novelty detection before they're allowed to join the archive, precisely so that a hallucination from one generation doesn't quietly poison every future retrieval Can RAG systems safely learn from their own generated answers?. The same instinct shows up defensively in noisy archives: when sources degrade (OCR errors, language drift in historical newspapers), the system is built to refuse rather than guess, trading coverage for integrity so each generation doesn't manufacture confident fiction Can RAG systems refuse to answer without reliable evidence?. Both notes share a thesis — accumulation is safe only when generation is constrained at the moment of write-back.
A second, lateral framing treats the changing knowledge not as a corpus but as a living document. The ACE framework handles evolving context as an incrementally edited 'playbook,' applying small curated updates through generation-reflection-curation loops instead of rewriting the whole thing each pass — which protects against the quieter failure mode where each regeneration compresses away detail until the knowledge collapses into uselessness Can context playbooks prevent knowledge loss during iteration?. Here the enemy isn't false additions but erosion: knowledge that changes by shrinking.
What's striking is that the corpus also contains the opposite philosophy. Some systems keep a persistent memory workspace across retrieval cycles specifically to detect and resolve contradictions as new evidence arrives Can reasoning systems maintain memory across retrieval cycles?, and others use each partial answer to reveal what to retrieve next, so generation itself drives what enters the working store Can a model's partial response guide what to retrieve next?. But there's a contrarian voice: memoryless, Markov-style reasoning argues that carrying accumulated history is baggage, and that contracting each step to depend only on the current state preserves coherence without the bloat — i.e., the safest way to handle knowledge that changes each generation is to deliberately not accumulate it Can reasoning systems forget history without losing coherence?.
The thing worth taking away: 'archiving' generated knowledge isn't one problem but a fork between two failure modes — contamination (bad outputs entering the record) and erosion (good detail compressing away) — and the field hasn't agreed on whether the cure is a gated, verified memory that grows or a disciplined forgetting that never lets the archive drift at all.
Sources 6 notes
Systems can add generated answers to their retrieval corpus when outputs pass entailment verification, source attribution checks, and novelty detection. This prevents hallucinations from polluting future retrievals while allowing genuine knowledge accumulation.
A multilingual RAG system for noisy historical newspapers succeeds by aggressively expanding retrieval while constraining generation to only grounded answers. The grounded-refusal prompt prevents hallucination when OCR errors and language drift degrade source quality, trading coverage for integrity.
The ACE framework treats contexts as evolving playbooks using generation-reflection-curation loops rather than full rewrites. This prevents knowledge loss from compression and detail erosion, achieving +10.6% on agentic tasks and +8.6% on finance without labeled supervision.
ComoRAG demonstrates that iterative evidence acquisition with a persistent memory workspace outperforms stateless multi-step retrieval by detecting and resolving contradictions through deeper exploration, achieving up to 11% gains on complex queries.
ITER-RETGEN shows that iteratively using generated responses as retrieval queries substantially improves performance on multi-hop reasoning and fact verification. Generation acts as both answer producer and information-need clarifier, surfacing implicit gaps that the original query missed.
Atom of Thoughts decomposes problems into DAGs and contracts them iteratively, ensuring each state depends only on the current problem—not prior steps. This memoryless approach eliminates historical baggage that bloats reasoning while maintaining answer equivalence.