Can curriculum degradation of document quality accelerate policy learning?
This explores whether deliberately lowering the quality of training documents in a staged, easy-to-hard curriculum could speed up reinforcement learning of a policy — and the corpus suggests the premise mixes up two different levers: ordering difficulty (which helps) and degrading content (which usually hurts).
This explores whether you can train a policy faster by feeding it progressively worse documents as a curriculum — and the most useful thing the corpus does is separate the part of that idea that works from the part that backfires. The working part is *ordering*: curriculum learning helps when it sequences difficulty intelligently. Can curriculum learning approximate expensive process supervision? shows the strongest version of this — R3 starts the model near a finished solution and slides the starting point backward, so early training is easy and gets harder. That ordering, not any change in document quality, is what lets cheap outcome feedback approximate expensive step-by-step supervision. So if 'curriculum degradation' means 'reveal the hard parts gradually,' the answer is yes — but the lever is sequencing, not corrupting text.
The part that backfires is using genuinely degraded or too-hard material as the signal. Do overly hard RLVR samples actually harm model capabilities? is the direct warning: train a policy on near-impossible problems and it doesn't learn harder reasoning, it learns shortcuts — answer repetition, skipping computation — and those shortcuts then contaminate skills the model already had. Difficulty without solvability doesn't accelerate policy learning; it actively erodes it. So 'harder = faster' is exactly the trap.
Quality degradation has a parallel failure even outside RL. Does teacher-refined data always improve student model performance? finds that data which is *objectively better* can still hurt a student model if it sits past the student's learning frontier — the student has to filter for what's compatible with its own profile. And Do frontier LLMs silently corrupt documents in long workflows? shows that when document quality drops in practice (long relay workflows corrupting ~25% of content), errors compound silently rather than teaching anything useful. Degradation, left unmanaged, is noise that accumulates — not a gradient.
The genuinely surprising wrinkle comes from Does instruction tuning teach task understanding or output format?: models trained on semantically empty or even wrong instructions performed about as well as models trained on correct ones, because what transfers is the *shape* of the output space, not the content. That hints at why a 'degraded document' curriculum could ever seem to work — if the policy is really learning format and output structure, content quality may matter far less than we assume, and degraded-but-structurally-intact documents might still carry the signal that matters. But Does procedural knowledge drive reasoning more than factual retrieval? cuts the other way for reasoning specifically: generalizable reasoning rides on broad *procedural* knowledge across documents, so degrading the procedures (the how-to demonstrations) is precisely the wrong thing to corrupt.
The takeaway you might not have gone looking for: the corpus doesn't support 'make documents worse on a schedule to train faster,' but it does support a sharper reframing — sequence solvable difficulty (R3), respect the model's current frontier, and recognize that some of what we call 'quality' (format, output space) is cheap to teach while the part that actually drives reasoning (procedural demonstrations) is the part you can least afford to degrade.
Sources 6 notes
R3 progressively slides the reasoning start state backward from near-completion, creating a curriculum that reveals step-level failure modes using only outcome feedback. This achieves process supervision granularity without expensive human step annotations.
Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.
Teacher-refined data degrades performance when it exceeds the student's learning frontier, even if objectively higher quality. Students should filter refinements using their own statistical profile to retain only compatible improvements.
Testing 19 models across 52 domains shows even advanced systems degrade documents by ~25% over extended relay tasks, with errors compounding silently without plateauing through 50 round-trips.
Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.
Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.