Mixture-of-Experts Meets Instruction Tuning: A Winning Combination for Large Language Models

Paper · arXiv 2305.14705 · Published May 24, 2023
Training and Fine-Tuning

Sparse Mixture-of-Experts (MoE) is a neural architecture design that can be utilized to add learnable parameters to Large Language Models (LLMs) without increasing inference cost. Instruction tuning is a technique for training LLMs to follow instructions. We advocate combining these two approaches, as we find that MoE models benefit more from instruction tuning than dense models. In particular, we conduct empirical studies across three experimental setups: (i) Direct finetuning on individual downstream tasks devoid of instruction tuning; (ii) Instruction tuning followed by in-context few-shot or zero-shot generalization on downstream tasks; and (iii) Instruction tuning supplemented by further finetuning on individual downstream tasks. In the first scenario, MoE models overall underperform dense models of identical computational capacity. This narrative, however, dramatically changes with the introduction of instruction tuning (second and third scenario), used independently or in conjunction with task-specific finetuning. Our most powerful model, FLAN-MOE32B, surpasses the performance of FLAN-PALM62B on four benchmark tasks, while using only a third of the FLOPs.

Introduction. The recent years have witnessed remarkable advancements in the field of natural language processing (NLP), driven by the development of increasingly large and sophisticated deep learning models. Among these models, transformer-based language models [49] have emerged as the de facto standard for a wide range of NLP tasks, owing to their unparalleled capabilities in capturing complex linguistic patterns and generalizing across diverse contexts. One particularly successful paradigm for training such models is instruction-tuning [44, 52, 4, 28, 34, 38], which enhances their performance on specific tasks by adapting their pre-trained representations to follow natural language instructions. While the benefits of Large Language Models (LLMs) are indisputable, their rapidly growing size and computational requirements pose significant challenges in terms of training efficiency, memory footprint, and deployment costs. Consequently, there is a pressing need for developing scalable techniques that can harness the power of these models without incurring prohibitive computational overheads.

Discussion / Conclusion. In this work, we have introduced FLAN-MOE, an innovative method to amplify the scalability of instruction-tuned language models by employing the sparse Mixture-of-Experts (MoE) technique. Our strategy amalgamates the merits of instruction-finetuning, which bolsters task-specific performance, and MoE, which provides computational efficiency coupled with diminished memory requirements. We have substantiated the effectiveness of FLAN-MOE through comprehensive experiments across a wide spectrum of Natural Language Processing (NLP) tasks, such as natural language understanding, question answering, and reasoning. Our results consistently underscore the superior performance of FLAN-MOE over current state-of-the-art methods, marking substantial advancements in both accuracy and efficiency.