Generative Modeling with Variational Autoencoders
Till now, I have talked about supervised learning where our model is trained to learn a mapping function of data
The other side of machine learning is called unsupervised learning where we just have data
Autoencoders
An autoencoder is a bottleneck structure that extracts features
- Encoder: network that compresses the input into a latent-space representation (of a lower dimension)
- Decoder: network that reconstructs the output from this representation
Since we want the autoencoder to learn to reconstruct the input data, the loss objective is a simple L2-loss between the input and the reconstructed output
This approach is used for learning a lower-dimensional feature representation from the training data that captures meaningful factors of variation in it. After training, we get rid of the decoder and use the encoder to obtain features that contain useful information about the input that can be used for downstream tasks.
Autoencoders are generally used for dimensionality reduction (or data compression) and to initialize a supervised model. Since it is not probabilistic, we cannot sample new data from the learned model, and new images cannot be generated
Variantional Autoencoders (VAEs)
VAEs, introduced in Auto-Encoding Variational Bayes paper are a probabilistic version of autoencoder that lets you sample from the model to generate new data. Instead of mapping the input into a fixed vector, we want to map it into a distribution, denoted as
- Prior
: Prob distribution over the latent variables (fixed and often assumed to be a simple Gaussian) - Posterior
: Prob distribution over images conditioned on the latent variables - Likelihood
: Prob distribution over latent variables given input data
Assuming that we know the real parameter
- First, sample a
from a true prior distribution . - Feed it into the decoder model to obtain the posterior
. - Then sample a new data point
from this true posterior.
As you would be wondering a neural network cannot output a probability distribution so, instead the decoder predicts the mean
There is an inherent assumption here that the pixels of the generated image are conditionally independent given the latent variable and this is the reason why VAEs tend to generate blurry images.
In order to estimate the true parameter
Before moving further, let's see how we can obtain the distribution of the data
-
Marginalize out the latent variable to simplify as,
The terms in the integral - prior (fixed) and posterior (estimated by the decoder) are straightforward to compute. However, it is impossible to integrate over all
as it is expensive to check all the possible values of and sum them up. -
Another idea is to use Baye's rule,
The numerator has the same terms as above. The denominator has the likelihood term that can be approximated by another neural network (encoder) as
.Again, we assume that is follows a Gaussian distribution as
, and the encoder predicts the mean and diagonal variance .
The encoder, therefore, outputs a distribution of latent variables
- Input a real data point
to the encoder to obtain the likelihood . - Sample a latent variable from this likelihood
- Feed it into the decoder to obtain the posterior
. - Then sample a new data point from this posterior
Sampling is a stochastic process and therefore we cannot backpropagate the gradient. To make it trainable, the reparameterization trick is introduced such that,
Deriving ELBO Loss
In order to obtain the loss objective of a VAE, we further simpify Equation 2 by first multipying
The first term is the data reconstruction term that measures the reconstruction likelihood of the decoder from our variational distribution. It ensures that the learned distribution is modeling effective latents that the original data can be regenerated from.
The second term measures how similar the learned variational distribution is to a prior belief held over latent variables. Since both the distributions are Gaussians, KL-divergence can be computer in closed form.
This gives us a lower bound on the log-likelihood of the data and we jointly train both the encoder and decoder to maximize it.
The loss function (that we minimize) is given as,
Comments
Post a Comment