Why do overtrained domains show different RL training outcomes than novel tasks?
This explores why reinforcement learning behaves differently on domains a model already saw heavily during pretraining versus tasks that are genuinely new to it — and the corpus suggests the dividing line is whether RL is surfacing capabilities that already exist or building ones that don't.
This reads the question as: when a model is already saturated with a domain (it was overtrained on it), RL training plays a different role than when the task is unfamiliar. The clearest answer in the corpus is that RL is usually an *activation* mechanism, not a *teaching* one. On familiar, well-pretrained territory, RL doesn't add new reasoning — it surfaces and amplifies strategies already latent in the pretrained prior, and its updates are structurally sparse and bounded by what pretraining laid down How does RL training reshape reasoning and what gets lost?. So an overtrained domain shows fast, modest, ceiling-bound gains: there's little new to create, only existing behavior to sharpen.
The flip side is domain-conditional. One note draws the line explicitly: for standard reasoning tasks RL just activates abilities already present in the base model, but for complex planning that needs multi-step coordination, RL can generate genuinely novel strategies the base model can't reach even with heavy sampling Does reinforcement learning create new reasoning abilities or activate existing ones?. That's why 'novel tasks' diverge — they leave room for RL to act as an emergence engine, producing sophisticated reasoning from simple accuracy rewards without distilling any teacher's chain-of-thought Can simple rewards alone teach complex domain reasoning?. Overtrained domains have already spent that headroom.
There's also a mechanical story about what RL *does* to an overtrained model: it collapses diversity. Controlled experiments show RL converges on a single dominant pretraining format within the first epoch and suppresses the alternatives — and which format wins depends on model scale, not on which one performs best Does RL training collapse format diversity in pretrained models?. In a domain the model already over-represents, this collapse is the main visible effect, which looks very different from a novel task where RL is still expanding the strategy space rather than pruning it.
The outcome also depends on how the domain interacts with reward structure. RL gains track how verifiable the reward is — crisp binary signals unlock dramatic jumps while fuzzy judgment-based ones barely move the needle Why does RL succeed more on some tasks than others? — and entropy moves in opposite directions across domain types: structured domains shrink output entropy while creative ones expand it, so training order itself reshapes outcomes Does training order reshape how models handle different task types?. Push difficulty too far and it backfires: on near-impossible samples models learn degenerate shortcuts that contaminate capabilities they already had Do overly hard RLVR samples actually harm model capabilities?.
The thread tying this together is that 'domain training has domain-conditional sweet spots, and the visible wins hide costs' — gains in one place come paired with quiet degradation in reasoning faithfulness, transfer, and format flexibility How do domain training techniques actually reshape model behavior?. So the reason overtrained domains and novel tasks diverge isn't one effect but a family: activation vs. creation, format collapse vs. format expansion, and reward-shape sensitivity all land differently depending on how much the model already knows. The thing you didn't know you wanted to know: RL's biggest effect on a domain you've already mastered may be making the model *narrower and more confident*, not better — which is also why binary rewards quietly wreck calibration unless you add a scoring term to counteract them Does binary reward training hurt model calibration?.
Sources 9 notes
Research shows that verifiable rewards act as catalysts that surface existing capabilities from pretraining, not teachers that build new reasoning. RL updates are structurally sparse and bounded by the pretrained prior, not algorithmic sophistication.
For standard reasoning tasks, RL activates latent abilities already present in base models. For complex planning requiring multi-step coordination, RL generates genuinely novel strategies inaccessible to base models even with extensive sampling.
Medical AI systems and o3 demonstrate that sophisticated domain reasoning emerges naturally from RL training on difficult problems with only basic accuracy signals, without requiring explicit chain-of-thought distillation from teacher models.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
Binary verifiable rewards enable dramatic RL gains (0.15% to 73.98%), while judgment-based evaluation yields modest improvements (55% reduction). Clear reward signals unlock suppressed capabilities; fuzzy signals barely move the needle.
Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.
Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.
Research shows every adaptation method—from parameter-efficient tuning to knowledge graph curricula—has optimal conditions tied to specific domains. The key finding: visible benefits like performance gains often come with hidden degradation in reasoning faithfulness, capability transfer, and format flexibility.
Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.