Diffusion models are a new class of state-of-the-art generative models that generate diverse high-resolution images. There are already a bunch of different diffusion models that include Open AI’s DALL-E 2 and GLIDE, Google’s Imagen, and Stability AI’s Stable Diffusion. In this blog post, we will dig our way up from the basic principles described in the most prominent one, which is the Denoising Diffusion Probabilistic Models (DDPM) as initialized by Sohl-Dickstein et al in 2015 and then improved by Ho. et al in 2020.
 |
Images produced by Dall-E 2 |
The basic idea behind diffusion models is rather simple. It takes an input image and gradually adds Gaussian noise to it through a series of time steps. We will call this the forward process. A network is then trained to recover the original image by reversing the noising process. By being able to model the reverse process, we can start from random noise and denoise it step-by-step to generate new data.
Forward Diffusion Process
Consider an image sampled from the real data distribution (or the training set). The subscript denotes the number of time step. The forward process denoted by is modeled as a Markov chain, where the distribution at a particular time step depends only on the sample from the previous step. The distribution of corrupted samples can be written as,
At each step of the Markov chain, we add Gaussian noise to producing a new latent variable . The transition distribution forms a unimodal diagonal Gaussian as,
where is the variance of Gaussian at a time step . It is a hyperparameter that follows a fixed schedule such that it increases with time and lies in the range .
Ho et al. sets a linear schedule for the variance starting from to , and .
A latent variable can be sampled from the distribution by using the reparameterization trick is as,
where .
Equation 3 shows that we need to compute all the previous samples in order to obtain , making it expensive. To solve this problem, we define,
and rewrite equation 3 in a recursive manner,
The close-form sampling at any arbitrary timestep can be carried out using the following distribution,
Since is a hyperparameter that is fixed beforehand, we can precompute and for all timesteps and use Equation 4 to sample the latent variable in one go.
Reverse Diffusion Process
As , , the distribution (also called isotropic Gaussian distribution), losing all information about the original sample. Therefore if we manage to learn the reverse distribution, we can sample , and run the denoising process step-wise to generate a new sample.
With a small enough step size (), the reverse process has the same functional form as the forward process. Therefore, the reverse distribution can also be modeled as a unimodal diagonal Gaussian. Unfortunately, it is not straightforward to estimate , as it needs to use the entire dataset (It's intractable since it requires knowing the distribution of all possible images in order to calculate this conditional probability).
Hence, we use a network to learn this Gaussian by parameterizing the mean and variance,
Apart from the latent sample , the model also takes time step as input. Different time steps are associated with different noise levels, and the model learns to undo these individually.
Like the forward process, the reverse process can also be set up as a Markov chain. We can write the joint probability of the sequence of samples as,
Here, as we start training with a sample from pure noise distribution.
Training Objective (Loss function)
The forward process is fixed and it's the reverse process that we solely focus on learning. Diffusion models can be seen as latent variable models, and are similar to variational autoencoders (VAEs), where is an observed variable and are latent variables.
Maximizing the variational lower bound (also called evidence lower bound ELBO) on the marginal log-likelihood forms the objective in VAEs. For an observed variable and latent variable , this lower bound can be written as,
Rewriting it in the diffusion model framework we get,
The objective of maximizing this lower bound is equivalent to minimizing a loss function that is its negation,
The term has no trainable parameters so it's ignored during training, furthermore, as we have assumed a large enough such that the final distribution is
Gaussian, this term effectively becomes zero.
can be interpreted as a reconstruction term (similar to VAE).
The term formulates the difference between the predicted denoising steps and the reverse diffusion step (which is given as a target to the model). It is explicitly conditioned on the original sample in the loss function so that the distribution takes the form of Gaussian.
But why do we need it to be Gaussian?
Since the model output, is already parameterized as a Gaussian, every KL term compares two Gaussian distributions and therefore they can be computed in closed form. This makes the loss function tractable.
Intuitively, a painter (our generative model) needs a reference image () to slowly draw (reverse diffusion step ) an image. Thus, we can take a small step backward, meaning from noise to generate an image, if and only if we have as a reference.
Using Bayes' rule we can write,
Such that the variance is,
and the mean is,
Equation 4 is , which can be rewritten as,
. Substituting,
Therefore, the term estimates the KL-divergence between,
Recall that we learn a neural network that predicts the mean and diagonal variance of the Gaussian distribution of the reverse process. Ho et al. decided to keep the predicted variances fixed to time-dependent constants because they found that learning them leads to unstable training and poorer sample quality. They set , where or (both gave same results).
Because is available as input at training time, instead of predicting the mean (equation 10), we make it predict the noise term using . We can then write the predicted mean as,
and the predicted de-noised sample can be written using the reparameterization trick as,
where at each time step.
Thus, the network predicts only the noise term at each time step . Let's simplify the term , given that the variances are equal. KL-divergence b/w two Gaussians is given as:,
where is number of dimensions.
The objective reduces to a weighted L2-loss between the noises and second term of the loss function becomes,
Empirically, Ho et al. found that training the diffusion model works better with a simplified objective that ignores the weighting term in . They also got rid of the term by altering the sampling method, such that at the end of sampling ( = 1), we obtain .
The simplified loss function for DDPM is given as,
 |
The training and sampling algorithms in the DDPM paper (Ho et al.) |
Credits: While writing this blog, I have referred to Lillian Weng's blog post, AI Summer's post and Luo at al. (2022)
Comments
Post a Comment