Can the serving loop itself become the primary training data source?
This explores the 'data flywheel' idea — whether a model's own production traffic (the questions it answers, the outputs it generates while serving users) can replace external datasets as the main thing it trains on, and what makes that loop virtuous instead of degenerate.
This explores the 'data flywheel' idea: can the act of serving users — answering queries, generating outputs, getting feedback — become the primary fuel for training, instead of curated external datasets? The corpus says yes in principle, but only when the loop has a verification gate. The cleanest demonstration is bidirectional RAG, where generated answers get written back into the retrieval corpus so the system literally grows its own knowledge base during use — but only after each output passes entailment checks, source attribution, and novelty detection Can RAG systems safely learn from their own generated answers?. Strip the gate and you don't get learning, you get pollution: hallucinations feeding future retrievals. So the real question isn't 'can the serving loop be the data source' but 'what filter keeps the loop from poisoning itself.'
Several notes attack that filter problem from different directions. Self-play frameworks generate their own curriculum with no external data at all — a proposer invents calibrated problems and a solver learns by majority-vote agreement, both improving through RL alone Can language models improve themselves without any external training data?. Others build the verifier into the model itself: post-completion learning trains a model to score its own outputs using the unused sequence space after its answer, internalizing the reward function rather than calling out to a separate judge Can models learn to evaluate their own work during training?, while self-supervised process rewards reach expert-level step grading from pseudo-labels with zero human annotation Can self-supervised process rewards replace human annotation?. You can even simulate the *inputs* — LLMs can stand in for a search engine during agent training, generating documents from internal knowledge well enough to match real APIs Can LLMs replace search engines during agent training?. Each is a way to close the loop without a human in it.
But the corpus is unusually loud about the failure mode, and this is the part worth knowing. A serving loop that trains on its own outputs is a feedback loop, and feedback loops converge on degenerate equilibria that amplify the system's own past decisions. YouTube's ranker needs an explicit selection-bias correction — a position tower — precisely because training on logged serving data otherwise teaches the model to reinforce whatever it already showed Why do ranking systems need to model selection bias explicitly?. The same gravity shows up in RL post-training, which collapses onto a single dominant output format within the first epoch and suppresses the alternatives, with the 'winner' decided by model scale rather than quality Does RL training collapse format diversity in pretrained models?. And binary correctness rewards — the obvious signal a serving loop would harvest — provably wreck calibration by rewarding confident guessing, fixable only by adding a proper scoring rule Does binary reward training hurt model calibration?.
There's also a compatibility catch that complicates the naive 'just feed outputs back' picture: higher-quality self-generated data isn't automatically better. Teacher-refined data degrades a student when it exceeds the student's learning frontier, so a serving loop needs to filter its own harvest against what the current model can actually absorb Does teacher-refined data always improve student model performance?. And once you're training on serving traffic, that traffic becomes an attack surface — corpus poisoning and write-back channels need retrieval-time defenses like partitioned retrieval and similarity-collapse detection, since a self-feeding loop will faithfully ingest whatever an adversary plants Can we defend RAG systems from corpus poisoning without retraining?.
The synthesis worth taking away: the serving loop *can* be the primary data source, and the field is clearly building toward it from both the generation side (write-back, self-play, simulated inputs) and the supervision side (internalized self-evaluation, self-supervised rewards). What separates a flywheel from a death spiral is a verification gate plus an explicit bias correction — without those, a model trained on its own serving traffic doesn't learn, it narrows: it gets more confident, less diverse, and more vulnerable to anything it accidentally swallowed.
Sources 10 notes
Systems can add generated answers to their retrieval corpus when outputs pass entailment verification, source attribution checks, and novelty detection. This prevents hallucinations from polluting future retrievals while allowing genuine knowledge accumulation.
SQLM uses a proposer-solver framework where the proposer generates calibrated problems and the solver learns via majority-vote verification. Both agents improve through RL alone, creating an automatic curriculum that scales without human labels or ground-truth answers.
Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.
MetaStone-S1's SPRM achieves o3-mini-level results using dynamic weighting of pseudo-labels instead of human-annotated steps. This eliminates the annotation bottleneck for process supervision, though generalization to fuzzy-outcome domains remains unproven.
ZeroSearch and SSRL demonstrate that LLMs can generate relevant documents and search results from internal knowledge, with 14B simulators matching or exceeding real search engines. Curriculum degradation and test-time scaling optimize this approach for training without API costs.
YouTube's multi-objective ranker uses MMoE for conflicting objectives and a shallow position tower to remove selection bias from training data. Without both mechanisms, models converge on degenerate equilibria that amplify their own past decisions.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.
Teacher-refined data degrades performance when it exceeds the student's learning frontier, even if objectively higher quality. Students should filter refinements using their own statistical profile to retain only compatible improvements.
RAGPart and RAGMask provide lightweight, retraining-free defenses that operate at the retrieval layer. RAGPart bounds poisoned-document influence via partitioned retriever learning; RAGMask flags suspicious documents through abnormal similarity collapse under token masking.