Continual Instruction Tuning for Large Multimodal Models
Instruction tuning is now a widely adopted approach to aligning large multimodal models (LMMs) to follow human intent. It unifies the data format of vision-language tasks, enabling multi-task joint training. However, visionlanguage tasks are constantly being created in practice. Instead of always re-training LMMs when new tasks arrive, continual learning offers flexibility for models to continually and efficiently exploit the evolving data. This work aims to explore the following two questions: 1) Do LMMs still suffer from catastrophic forgetting in continual instruction tuning? 2) Are the existing three classes of continual learning methods still applicable to the continual instruction tuning of LMMs? An extensive study is conducted to address the above questions. First, we establish the first benchmark in this setting and reveal that catastrophic forgetting is still observed when continually instruction-tuning LMMs. However, the multi-task joint instruction tuning can facilitate the model’s continual learning ability and mitigate forgetting. Second, we integrate and adapt classic continual learning methods to our context, demonstrating the efficacy of data replay and model expansion strategies across diverse scenarios.
Introduction. Inspired by the success of GPT4, an array of works pertaining to large multimodal models (LMMs) have emerged recently [5, 18, 19, 45]. These LMMs typically undergo a two-stage training process, first pretraining for text-image alignment and then finetuning for downstream tasks. In the second phase, instruction tuning stands out as a widely adopted scheme for aligning LMMs with human intent. This approach enables multi-task training with a unified image-instruction-output data format and makes the trained models easier to generalize to unseen tasks [14]. While LMMs exhibit impressive zero-shot performance on unseen instructions, expanding the training datasets to incorporate new task data can substantially enhance their capabilities on the new task [18]. However, since visionlanguage tasks can be constantly created, it is costly to always merge the incoming data to retrain the LMMs. Hence, an approach is sought that can render the model flexible enough to continually and efficiently exploit the ever- emerging data.
Discussion / Conclusion. In this paper, we conduct a comprehensive study on continual instruction tuning for large multimodal models. First, we established the first benchmarks in this setup and found that sequential instruction tuning on these benchmarks still leads to catastrophic forgetting. Second, by integrating or adapting existing continual learning methods, we consistently observed favorable results with replay- based and model expansion methods. However, the efficacy of regularization-based methods requires a model to be first jointly instruction-tuned on multiple tasks. Third, observing that task similarity greatly affects the model’s anti-forgetting and transfer ability, we introduce it into the regularization-based and model expansion methods to enhance their performance and utility. We hope that this work will provide some guidance to the community and contribute to the development of new continual instruction tuning methods for LMMs.