Does task ordering affect multi-task reinforcement learning outcomes?

This explores whether the *order* in which you train a model on different tasks changes the final result in multi-task reinforcement learning — and the corpus says yes, decisively, in more than one way.

This explores whether task ordering matters in multi-task RL — not just whether you train on many tasks, but whether the *sequence* changes outcomes. The corpus is unusually direct here: ordering isn't a minor knob, it's a mechanism. The clearest evidence comes from Omni-Thinker, which shows that different domains pull a model's output entropy in opposite directions — structured tasks (math, code) drive entropy *down* toward sharp, single answers, while creative tasks drive it *up* toward open-ended variety. Train them in the wrong order and the entropy collapse from structured tasks quietly damages the model's open-ended capabilities. Schedule structured tasks first, guided by backward-transfer measurements, and you recover a 6.2% gain over throwing everything in together Does training order reshape how models handle different task types?. So order matters because tasks leave a *residue* on the model that helps or hurts whatever comes next.

The same lesson shows up one level down, *within* a single task's training. RL doesn't learn uniformly — it moves through a predictable two-phase sequence, first nailing execution correctness (getting steps right), then shifting the bottleneck to strategic planning (deciding what to do). Concentrating optimization on planning tokens only pays off once execution has stabilized — the right intervention depends entirely on which phase you're in Does RL training follow a predictable two-phase learning sequence?. Ordering, in other words, is fractal: it governs both the sequence of tasks and the sequence of skills inside a task.

The most striking ordering result is about pairing methods rather than topics. Running supervised imitation *first* to build reasoning foundations, then verifiable-reward RL to sharpen them, beats either method used alone — and the order is the whole point. The imitation phase creates plausible attempts for the RL phase to refine; without it, the outcome rewards have nothing informative to grab onto Does sequencing imitation then exploration training improve reasoning?. This is a recurring theme: RL only works when the reward signal can actually discriminate, which is why dramatic gains show up on tasks with clean binary rewards and barely move on fuzzy judgment-based ones Why does RL succeed more on some tasks than others?. Curriculum ordering is partly a trick for making later rewards *legible* — you arrange training so each stage hands the next a signal it can learn from.

There's a quieter ordering question lurking in how you treat individual episodes, too. SkillRL shows that successes and failures shouldn't be consolidated the same way — keep wins as concrete demonstrations, distill losses into abstract lessons — which suggests the *processing order and asymmetry* of experiences matters as much as the task schedule Should successful and failed episodes be processed differently?. And if you want to look further upstream, there's work pushing reasoning even earlier than the RL stage entirely, planting chain-of-thought during pretraining so the model arrives at RL already primed Can chain-of-thought reasoning be learned during pretraining itself?.

The thing you might not have expected to learn: the reason ordering matters isn't really about "warming up" the model. It's that each training stage changes what the *next reward signal can teach*. Structured tasks collapse the entropy that creative tasks need; imitation manufactures the rollouts that verifiable rewards depend on. Ordering is how you keep the learning signal informative all the way through — get the sequence wrong and you don't just lose efficiency, you erase capabilities the model already had.

Sources 6 notes

Does training order reshape how models handle different task types?

Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Does sequencing imitation then exploration training improve reasoning?

Running Supervised RL first to establish reasoning foundations, then RLVR to refine against verifiable rewards, substantially outperforms both methods in isolation. The imitation phase makes outcome rewards informative by creating reasonable rollouts the RL phase can then sharpen.

Why does RL succeed more on some tasks than others?

Binary verifiable rewards enable dramatic RL gains (0.15% to 73.98%), while judgment-based evaluation yields modest improvements (55% reduction). Clear reward signals unlock suppressed capabilities; fuzzy signals barely move the needle.

Should successful and failed episodes be processed differently?

SkillRL demonstrates that treating successful episodes as concrete demonstrations and failures as abstracted lessons achieves state-of-the-art performance on complex tasks while using substantially less context than uniform approaches. The asymmetry mirrors human expert reasoning and avoids the degradation seen in uniform consolidation methods.

Can chain-of-thought reasoning be learned during pretraining itself?

RLP treats CoT as exploratory action during pretraining, using log-likelihood improvement as verifier-free reward. Applied to Qwen3-1.7B and Nemotron-Nano-12B, the method improves math and science benchmarks substantially, suggesting reasoning can be planted earlier in training.

Does task ordering affect multi-task reinforcement learning outcomes?

Sources 6 notes

Next inquiring lines