Can backward transfer measurements reliably predict optimal multi-task training order?
This explores whether backward transfer (BWT) — a measure of how training on a later task helps or hurts what you learned earlier — can actually be used as a planning signal to decide the best sequence for training a model on several tasks at once.
This explores whether backward transfer scores can serve as a reliable planning tool for ordering multi-task training, rather than just describing what happened after the fact. The strongest case for "yes" in the corpus comes from Omni-Thinker Does training order reshape how models handle different task types?, where BWT-guided scheduling — training structured, verifiable tasks before open-ended creative ones — beats joint training by 6.2%. But notice *why* it works: the win isn't because BWT is magic, it's because BWT happens to track a deeper mechanical cause. Structured domains shrink a model's output entropy while creative ones expand it, and training them in the wrong order lets entropy collapse permanently damage open-ended ability. BWT is reliable here precisely because it's a proxy for that one-directional entropy dynamic.
That caveat matters, because a parallel finding suggests the underlying damage BWT measures can be brutally fast and irreversible. RL training tends to lock onto a single dominant output format within the first epoch and suppress the alternatives Does RL training collapse format diversity in pretrained models? — and the format that wins is driven by model scale, not by which format performs best. If the destructive collapse happens that early, a training-order plan built on BWT is really betting on getting the *first* task right, not on fine-tuning a long sequence. Order matters most at the front.
There's also a quieter warning about whether BWT measured on one model transfers to another. Teacher-refined data that is objectively higher quality still *degrades* a student when it exceeds the student's learning frontier Does teacher-refined data always improve student model performance? — meaning transfer effects are model-specific, not properties of the tasks alone. A backward-transfer score is similarly a property of *this* model's current state, so an "optimal order" derived from it may not generalize across scales or checkpoints.
The corpus also hints that you don't always need BWT to order training well — other cheaper signals do similar work. Representation sparsity can order few-shot demonstrations from hard to easy with no difficulty labels at all Can representation sparsity order few-shot demonstrations effectively?, and reverse-curriculum scheduling orders reasoning by sliding the start state backward from near-completion Can curriculum learning approximate expensive process supervision?. These suggest "optimal order" is often recoverable from intrinsic difficulty or geometry signals, with BWT being one tool among several rather than the privileged predictor.
The surprising twist: there's evidence that some learning outcomes are predictable *before* any gradient step, from pre-learning probabilities alone Can we predict keyword priming before learning happens?. And length-generalization ability transfers between related tasks because they share and reuse the same attention heads Can length generalization transfer between different related tasks? — implying transfer is structural and sometimes foreseeable from architecture rather than something you must discover empirically per ordering. So the honest answer is: BWT can reliably guide order *when* it tracks a dominant directional mechanism like entropy collapse — but it's a symptom-reading of model-specific dynamics, not a universal scheduler, and cheaper or even a-priori signals can often substitute.
Sources 7 notes
Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
Teacher-refined data degrades performance when it exceeds the student's learning frontier, even if objectively higher quality. Students should filter refinements using their own statistical profile to retain only compatible improvements.
Sparsity-Guided Curriculum In-Context Learning uses last-layer activation sparsity to order demonstrations from sparse (harder) to dense (easier), yielding considerable performance improvements. This approach requires no external difficulty labels and works across diverse in-context learning tasks.
R3 progressively slides the reasoning start state backward from near-completion, creating a curriculum that reveals step-level failure modes using only outcome feedback. This achieves process supervision granularity without expensive human step annotations.
Pre-learning keyword probability strongly predicts post-learning priming across architectures and model sizes, with a ~10^-3 threshold separating contexts where priming occurs from those where it doesn't. Just 3 training exposures suffice to establish the effect.
Models trained jointly on related tasks reuse the same attention heads to handle length generalization, allowing shorter tasks to extrapolate beyond their training length. Pretrained models already contain this reusable computational scaffolding.