Towards Optimal Learning of Language Models

Paper · arXiv 2402.17759 · Published February 27, 2024
LLM Architecture

This work studies the general principles of improving the learning of language models (LMs), which aims at reducing the necessary training steps for achieving superior performance. Specifically, we present a theory for the optimal learning of LMs. We first propose an objective that optimizes LM learning by maximizing the data compression ratio in an “LM-training-as-lossless-compression” view. Then, we derive a theorem, named Learning Law, to reveal the properties of the dynamics in the optimal learning process under our objective. The theorem is then validated by experiments on a linear classification and a real-world language modeling task. Finally, we empirically verify that the optimal learning of LMs essentially stems from the improvement of the coefficients in the scaling law of LMs, indicating great promise and significance for designing practical learning acceleration methods. Our code can be found at https://aka.ms/LearningLaw.

Introduction. With the thriving of language models (LMs; HZD+21, BHA+21), there is an increasing focus on improving the learning [PR12, WSSC19] of LMs, which aims at accelerating the learning speed and achieving a certain model performance with as few training steps as possible [WWL+23]. This focus helps humans explore the limits of LMs given the rapid growth of their computational demand [HBM+22], and promotes democratization [CMS+23] of large language models (LLMs; BMR+20, Ope22, Ope23, CND+23, ADF+23), which is valuable for both research communities and industry sectors [TLI+23, TMS+23, JSM+23]. In this paper, we present a theory for optimal learning of LMs. Unlike prior works exploring practical acceleration methods at the model-level [XYH+20, ZH20], optimizer-level [YLR+20, LLH+24], or data-level [TSAM23, ATS+23, XPD+23], our work demonstrates the principles of optimizing the LM learning speed, including the optimization objective, the property of optimal learning dynamics, and the essential improvement of the learning acceleration.

Discussion / Conclusion. Summary. In this work, we establish a theory for the optimal learning of LMs. We propose an objective that maximizes the compression ratio in an LM-training-as-losses-compression view. Then we derive a theorem, named Learning Law, suggesting that all examples should be equally contributive to the LM in the optimal learning process, which is then validated by experiments in linear classification and real-world language modeling tasks. Finally, we empirically show that the optimal learning process essentially improves the scaling law coefficients of LMs, which sheds light on future works that design practical learning acceleration approaches. Limitations. One limitation of our work is that the experiments are conducted on relatively small scales. This is because our method to find the near-optimal learning policy corresponds to training a neural network with L × T layers, where L is the layers of the LM and T is the LM’s total training steps (see Appendix C for details). This leads to a high computational overhead when L and T scale up. However, since the theoretical derivation is generally applicable, we believe that our theory can be applied to LLMs.