Consistency Models Made Easy

Paper · arXiv 2406.14548 · Published June 20, 2024
Diffusion-Based LLMs

Consistency models (CMs) offer faster sampling than traditional diffusion models, but their training is resource-intensive. For example, as of 2024, training a state-ofthe-art CM on CIFAR-10 takes one week on 8 GPUs. In this work, we propose an effective scheme for training CMs that largely improves the efficiency of building such models. Specifically, by expressing CM trajectories via a particular differential equation, we argue that diffusion models can be viewed as a special case of CMs. We can thus fine-tune a consistency model starting from a pretrained diffusion model and progressively approximate the full consistency condition to stronger degrees over the training process. Our resulting method, which we term Easy Consistency Tuning (ECT), achieves vastly reduced training times while improving upon the quality of previous methods: for example, ECT achieves a 2-step FID of 2.73 on CIFAR10 within 1 hour on a single A100 GPU, matching Consistency Distillation trained for hundreds of GPU hours. Owing to this computational efficiency, we investigate the scaling laws of CMs under ECT, showing that they obey the classic power law scaling, hinting at their ability to improve efficiency and performance at larger scales.

Introduction. Diffusion Models (DMs) (Ho et al., 2020; Song et al., 2021a), or Score-based Generative Models (SGMs) (Song et al., 2020; 2021b), have vastly changed the landscape of visual content generation with applications in images (Rombach et al., 2021; Saharia et al., 2022; Ho et al., 2022a; Dhariwal and Nichol, 2021; Hatamizadeh et al., 2023; Ramesh et al., 2021), videos (Brooks et al., 2024; Blattmann et al., 2023; Bar-Tal et al., 2024; Ho et al., 2022b; Gupta et al., 2023), and 3D objects (Poole et al., 2022; Wang et al., 2024a; Lee et al., 2024; Chen et al., 2024; Babu et al., 2023). DMs progressively transform a data distribution to a known prior distribution (e.g. Gaussian noise) according to a stochastic differential equation (SDE) (Song et al., 2021b) and train a model to denoise noisy observations. Samples can be generated via a reverse-time SDE that starts from noise and uses the trained model to progressively denoise it.

Discussion / Conclusion. We propose Easy Consistency Tuning (ECT), a simple yet efficient scheme for training consistency models. The resulting models, ECMs, unlock state-of-the-art few-step generative capabilities at a minimal tuning cost and are able to benefit from scaling. We have made our code available to ease future prototyping, studying, and deploying consistency models within the community. One of the major limitations of ECT is that it requires a dataset to tune DMs to CMs. Recent works developed data-free approaches (Luo et al., 2024; Gu et al., 2023; Yin et al., 2023; Zhou et al., 2024a) for diffusion distillation. The distinction between ECT and data-free methods is that ECT learns the consistency condition on a given dataset through the self teacher, while data-free methods transfer knowledge from a frozen diffusion teacher. This feature of ECT can be a potential limitation since the training data of bespoke models are unavailable to the public. However, we hold an optimistic view on tuning CMs using datasets different from pretraining.