INQUIRING LINE

Can curriculum degradation of document quality accelerate policy learning?

This explores whether deliberately lowering the quality of training documents in a staged, easy-to-hard curriculum could speed up reinforcement learning of a policy — and the corpus suggests the premise mixes up two different levers: ordering difficulty (which helps) and degrading content (which usually hurts).


This explores whether you can train a policy faster by feeding it progressively worse documents as a curriculum — and the most useful thing the corpus does is separate the part of that idea that works from the part that backfires. The working part is *ordering*: curriculum learning helps when it sequences difficulty intelligently. Can curriculum learning approximate expensive process supervision? shows the strongest version of this — R3 starts the model near a finished solution and slides the starting point backward, so early training is easy and gets harder. That ordering, not any change in document quality, is what lets cheap outcome feedback approximate expensive step-by-step supervision. So if 'curriculum degradation' means 'reveal the hard parts gradually,' the answer is yes — but the lever is sequencing, not corrupting text.

The part that backfires is using genuinely degraded or too-hard material as the signal. Do overly hard RLVR samples actually harm model capabilities? is the direct warning: train a policy on near-impossible problems and it doesn't learn harder reasoning, it learns shortcuts — answer repetition, skipping computation — and those shortcuts then contaminate skills the model already had. Difficulty without solvability doesn't accelerate policy learning; it actively erodes it. So 'harder = faster' is exactly the trap.

Quality degradation has a parallel failure even outside RL. Does teacher-refined data always improve student model performance? finds that data which is *objectively better* can still hurt a student model if it sits past the student's learning frontier — the student has to filter for what's compatible with its own profile. And Do frontier LLMs silently corrupt documents in long workflows? shows that when document quality drops in practice (long relay workflows corrupting ~25% of content), errors compound silently rather than teaching anything useful. Degradation, left unmanaged, is noise that accumulates — not a gradient.

The genuinely surprising wrinkle comes from Does instruction tuning teach task understanding or output format?: models trained on semantically empty or even wrong instructions performed about as well as models trained on correct ones, because what transfers is the *shape* of the output space, not the content. That hints at why a 'degraded document' curriculum could ever seem to work — if the policy is really learning format and output structure, content quality may matter far less than we assume, and degraded-but-structurally-intact documents might still carry the signal that matters. But Does procedural knowledge drive reasoning more than factual retrieval? cuts the other way for reasoning specifically: generalizable reasoning rides on broad *procedural* knowledge across documents, so degrading the procedures (the how-to demonstrations) is precisely the wrong thing to corrupt.

The takeaway you might not have gone looking for: the corpus doesn't support 'make documents worse on a schedule to train faster,' but it does support a sharper reframing — sequence solvable difficulty (R3), respect the model's current frontier, and recognize that some of what we call 'quality' (format, output space) is cheap to teach while the part that actually drives reasoning (procedural demonstrations) is the part you can least afford to degrade.


Sources 6 notes

Can curriculum learning approximate expensive process supervision?

R3 progressively slides the reasoning start state backward from near-completion, creating a curriculum that reveals step-level failure modes using only outcome feedback. This achieves process supervision granularity without expensive human step annotations.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Does teacher-refined data always improve student model performance?

Teacher-refined data degrades performance when it exceeds the student's learning frontier, even if objectively higher quality. Students should filter refinements using their own statistical profile to retain only compatible improvements.

Do frontier LLMs silently corrupt documents in long workflows?

Testing 19 models across 52 domains shows even advanced systems degrade documents by ~25% over extended relay tasks, with errors compounding silently without plateauing through 50 round-trips.

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

Does procedural knowledge drive reasoning more than factual retrieval?

Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst evaluating whether curriculum degradation of document quality can accelerate policy learning in LLMs. A curated library (2023–2026) found the following — treat these as dated claims to be re-tested:

**What a curated library found — and when:**
- Ordering, not corruption, is the acceleration lever: reverse-curriculum RL (R3) starts near a solution and slides backward, making early training solvable and progressive (2024). Genuine difficulty without solvability induces shortcuts (answer repetition, computation-skipping) that erode existing skills (2026).
- Data objectively better than the model's frontier can still harm learning; the student must filter for compatibility (2024). Silent corruption in long workflows (~25% content loss) compounds as noise, not signal (2026).
- Instruction-tuned models learn output-space *shape* as much as content; semantic emptiness doesn't always hurt (2023). But generalizable reasoning depends on *procedural* knowledge in documents — degrading how-to demonstrations is the worst degradation (2025).

**Anchor papers (verify; mind their dates):**
- arXiv:2402.05808 (2024): Reverse Curriculum RL
- arXiv:2605.28388 (2026): Sample Difficulty in RLVR
- arXiv:2411.12580 (2025): Procedural Knowledge in Pretraining
- arXiv:2604.15597 (2026): Document Corruption in Delegation

**Your task:**
(1) **RE-TEST EACH CONSTRAINT.** For each finding above, assess whether newer models (o3, o4-class), improved RL harnesses (Verl, Hydra at scale), or multi-agent orchestration have since relaxed or overturned the degradation trap. Separate the durable claim (sequencing beats corruption) from perishable limits (how hard is too hard for a given scale?). Cite what moved the needle.
(2) **Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months.** Does any recent paper show a regime where document degradation *does* accelerate learning—and if so, under what conditions (domain, model size, RL variant)?
(3) **Propose 2 research questions that ASSUME the regime may have shifted:** e.g., whether procedural-knowledge-preserving degradation (corrupt examples, not how-to) can still accelerate; or whether multi-stage relay orchestration (checkpoint→refine→relay) changes the corruption calculus.

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines