Do Models Really Learn to Follow Instructions? An Empirical Study of Instruction Tuning
Recent works on instruction tuning (IT) have achieved great performance with zero-shot generalizability to unseen tasks. With additional context (e.g., task definition, examples) provided to models for fine-tuning, they achieved much higher performance than untuned models. Despite impressive performance gains, what models learn from IT remains understudied. In this work, we analyze how models utilize instructions during IT by comparing model training with altered vs. original instructions. Specifically, we create simplified task definitions by removing all semantic components and only leaving the output space information, and delusive examples that contain incorrect input-output mapping. Our experiments show that models trained on simplified task definition or delusive examples can achieve comparable performance to the ones trained on the original instructions and examples. Furthermore, we introduce a random baseline to perform zeroshot classification tasks, and find it achieves similar performance (42.6% exact-match) as IT does (43% exact-match) in low resource setting, while both methods outperform naive T5 significantly (30% per exact-match).
Introduction. Recently, instruction tuning(IT) has drawn much attention in the NLP communities, with the rapid growth of new models (Sanh et al., 2021; Wei et al., 2021; Ouyang et al., 2022) and datasets (Wang et al., 2022; Gupta et al., 2022; Finlayson et al., 2022; Mishra et al., 2021; Ye et al., 2021; Bach et al., 2022). Models trained with task instructions demonstrate impressive zero-shot cross-task generalization ability. Despite the remarkable results, how models utilize the instructions during training and inference time remains an open question. Prior works have raised the question of whether models really learn to follow the instructions or just capture spurious correlations. Jang et al. (2022), Webson and Pavlick (2021) showed that the current large language models (LLMs) can achieve similar performance with misleading instructions(prompts) in in-context learning(ICL) and few-shot learning scenarios. Min et al. (2022) analyze how model utilize examples in ICL. They observed that (1) Input-output mapping in examples is not important and(2) Output space information is crucial.
Discussion / Conclusion. Do Alpaca better follow the instruction on NatInst-V2 dataset? After our submission, new instruction tuning models, like Alpaca and Vicuna, are trained on distilled data from Chat-GPT and exhibit behavior closer to it. To investigate their instruction utilization, we conduct the “Altered Task Definition” experiment on LLaMA-7B (Touvron et al., 2023) and Alpaca-7B models using the NatInst-V2 test set. In Table 2, training the LLaMA model on the NatInst-V2 dataset using the Original task definition leads to substantial performance enhancements than zeroshot. However, the Simplified task definition also achieves comparable performance, with a minimal decrease of 3 (EM/Rouge-L)scores. This finding is consistent with our previous observations on the TK-Instruct and T0 models. Even without tuning on NatInst-V2, the Alpaca model demonstrates strong performance on the NatInst-V2 test set.