Generative Adversarial Networks: A Two-player game
Introduced in 2014 by Goodfellow. et al, Generative Adversarial Networks (GANs) revolutionized the field of Generative modeling. They proposed a new framework that generated very realistic synthetic data trained through a minimax two-player game.
With GANs, we don't explicitly learn the distribution of the data $p_{\text{data}}$, but we can still sample from it. Like VAEs, GANs also have two networks: a Generator and a Discriminator that are trained simultaneously.
A latent variable is sampled from a prior $\mathbf{z} \sim p(\mathbf{z})$ and passed through the Generator to obtain a fake sample $\mathbf{x} = G(\mathbf{z})$. Then the discriminator performs an image classification task that takes in all samples and classifies it as real ($1$) or fake ($0$). These are trained together such that while competing with each other, they make each other stronger at the same time.
To summarize,
A discriminator $D$ estimates the probability of a given sample coming from the real dataset. It works as a critic and is optimized to tell the fake samples from the real ones.
A generator $G$ outputs synthetic samples given a noise variable input $\mathbf{z}$. It is trained to capture the real data distribution (want $p_G \approx p_{\text{data}}$) so that its generative samples can be as real as possible, or in other words, can trick the discriminator to output higher probabilities.
Training Objective
We want to make sure that the Discriminator's decision over the real data is accurate, i.e. $D(\mathbf{x}) = 1$. This can be achieved by maximizing the log-likelihood over the data. \begin{equation*} \max_{\color{blue}{D}} \; \mathbb{E}_{\mathbf{x} \sim p_{\text{data}}} \left[ \log {\color{blue}{D}}(\mathbf{x}) \right] \tag{1} \end{equation*}
The Discriminator should also assign low probabilities to the fake samples generated by the Generator, i.e. $D(G(\mathbf{z})) = 0$. In other words, minimize the log-likelihood of the fake data. \begin{align*} \min_{\color{blue}{D}} \; \mathbb{E}_{{\color{green}{\mathbf{z}}} \sim p({\color{green}{z}})} [\log({\color{blue}{D}} \left( {\color{orange}{G}}({\color{green}{z}})) \right)] \tag{2.1} \end{align*} which can be re-written as, \begin{align*} & \max_{\color{blue}{D}} \; \mathbb{E}_{{\color{green}{\mathbf{z}}} \sim p({\color{green}{z}})} [\log(1 -{\color{blue}{D}} \left( {\color{orange}{G}}({\color{green}{z}})) \right)] \tag{2.2} \end{align*}
On the other hand, the Generator wants to fool the Discriminator such that it classifies fake samples as real, i.e. $D(G(\mathbf{z})) = 1$. This can be obtained by maximizing the log-likelihood over the fake data. \begin{align*} \max_{\color{orange}{G}} \; \mathbb{E}_{{\color{green}{\mathbf{z}}} \sim p({\color{green}{z}})} [\log({\color{blue}{D}} \left( {\color{orange}{G}}({\color{green}{z}})) \right)] \tag{3.1} \end{align*} which can be re-written as, \begin{align*} \min_{\color{orange}{G}} \; \mathbb{E}_{{\color{green}{\mathbf{z}}} \sim p({\color{green}{z}})} [\log(1 -{\color{blue}{D}} \left( {\color{orange}{G}}({\color{green}{z}})) \right)] \tag{3.2} \end{align*}
The full objective is given by,
\begin{align*} \large{\min_{\color{orange}{G}} \, \max_{\color{blue}{D}} \; \mathbb{E}_{\mathbf{x} \sim p_{\text{data}}} \left[ \log {\color{blue}{D}}(\mathbf{x}) \right] \; + \; \mathbb{E}_{{\color{green}{\mathbf{z}}} \sim p({\color{green}{z}})} [\log(1 - {\color{blue}{D}} \left( {\color{orange}{G}}({\color{green}{z}})) \right)]} \tag{4} \end{align*}
The objective of training GANs is therefore a minimax game: the generator $G$ is trying hard to trick the discriminator, while the critic model $D$ is trying hard not to be cheated. In this process, both networks force each other to improve their functionalities. And both are trained together with alternating gradient updates (gradient ascent on discriminator and gradient descent on generator).
However, at the start of training, the generator is really bad due to which $D(G(\mathbf{z}))$ is close to zero, creating a vanishing gradients problem for the Generator (as loss = 0). Therefore, instead of $\min_{\color{orange}{G}} \; \mathbb{E}_{{\color{green}{\mathbf{z}}} \sim p({\color{green}{z}})} [\log(1 -{\color{blue}{D}} \left( {\color{orange}{G}}({\color{green}{z}})) \right)]$, we use $\min_{\color{orange}{G}} \; \mathbb{E}_{{\color{green}{\mathbf{z}}} \sim p({\color{green}{z}})} - [\log({\color{blue}{D}} \left( {\color{orange}{G}}({\color{green}{z}})) \right)]$ or just $\max_{\color{orange}{G}} \; \mathbb{E}_{{\color{green}{\mathbf{z}}} \sim p({\color{green}{z}})} [\log({\color{blue}{D}} \left( {\color{orange}{G}}({\color{green}{z}})) \right)]$ (Equation 3.2) as an objective for the Generator.
Training Algorithm |
Optimality
Let's write Equation 4 with a change of variables (ignoring the improvement of Generator for now), \begin{align*} & \min_{\color{orange}{G}} \, \max_{\color{blue}{D}} \; \mathbb{E}_{\mathbf{x} \sim p_{\text{data}}} \left[ \log {\color{blue}{D}}(\mathbf{x}) \right] \; + \; \mathbb{E}_{{\color{green}{\mathbf{z}}} \sim p({\color{green}{z}})} [\log(1 - {\color{blue}{D}} \left( {\color{orange}{G}}({\color{green}{z}})) \right)] \\ &= \min_{\color{orange}{G}} \, \max_{\color{blue}{D}} \; \mathbb{E}_{\mathbf{x} \sim p_{\text{data}}} \left[ \log {\color{blue}{D}}(\mathbf{x}) \right] \; + \; \mathbb{E}_{\mathbf{x} \sim p_{\color{orange}{G}} } [\log(1 - {\color{blue}{D}} (\mathbf{x}) )] \\ &= \min_{\color{orange}{G}} \, \int_{\mathbf{x}} \max_{\color{blue}{D}} \left( p_{\text{data}}(\mathbf{x}) \, \log {\color{blue}{D}}(\mathbf{x}) \; + \; p_{\color{orange}{G}}(\mathbf{x}) \, \log(1 - {\color{blue}{D}} (\mathbf{x}) ) \right)\\ \\ &\text{For the max function, we take the derivative wrt ${\color{blue}{D}}$ and assign it to zero.}\\ &\frac{p_{\text{data}}(\mathbf{x})}{{\color{blue}{D}}(\mathbf{x})} - \frac{p_{\color{orange}{G}}(\mathbf{x})}{(1 - {\color{blue}{D}} (\mathbf{x}))} = 0 \; \Rightarrow {{\color{blue}{D}}_{\max}(\mathbf{x})} = \frac{p_{\text{data}}(\mathbf{x})}{p_{\text{data}}(\mathbf{x}) + p_{\color{orange}{G}}(\mathbf{x})} \text{ [Optimal Discriminator]} \\ \\ &= \min_{\color{orange}{G}} \, \int_{\mathbf{x}} \left( p_{\text{data}}(\mathbf{x}) \, \log \frac{p_{\text{data}}(\mathbf{x})}{p_{\text{data}}(\mathbf{x}) + p_{\color{orange}{G}}(\mathbf{x})} \; + \; p_{\color{orange}{G}}(\mathbf{x}) \, \log \frac{p_{\color{orange}{G}}(\mathbf{x})}{p_{\text{data}}(\mathbf{x}) + p_{\color{orange}{G}}(\mathbf{x})} \right)\\ &= \min_{\color{orange}{G}} \, \left( \mathbf{E}_{\mathbf{x} \sim p_{\text{data}}} \, \left[ \log \frac{p_{\text{data}}(\mathbf{x})}{p_{\text{data}}(\mathbf{x}) + p_{\color{orange}{G}}(\mathbf{x})} \right] \; + \; \mathbf{E}_{\mathbf{x} \sim p_{\color{orange}{G}}} \, \left[ \log \frac{p_{\color{orange}{G}}(\mathbf{x})}{p_{\text{data}}(\mathbf{x}) + p_{\color{orange}{G}}(\mathbf{x})} \right] \right)\\ \\ &\text{Multiply and divide by 2} \\ &= \min_{\color{orange}{G}} \, \left( \mathbf{E}_{\mathbf{x} \sim p_{\text{data}}} \, \left[ \log \frac{2 \, p_{\text{data}}(\mathbf{x})}{p_{\text{data}}(\mathbf{x}) + p_{\color{orange}{G}}(\mathbf{x})} \right] \; + \; \mathbf{E}_{\mathbf{x} \sim p_{\color{orange}{G}}} \, \left[ \log \frac{2 \, p_{\color{orange}{G}}(\mathbf{x})}{p_{\text{data}}(\mathbf{x}) + p_{\color{orange}{G}}(\mathbf{x})} \right] - \log 4 \right) \\ \\ &\text{From the definition of KL-Divergence: } \mathbf{D}_{KL}(p || q) = \mathbf{E}_{x \sim p} \left[ \log \frac{p(x)}{q(x)} \right]\\ &= \min_{\color{orange}{G}} \, \left( \mathbf{D}_{KL} \left[ p_{\text{data}} \; || \; \frac{p_{\text{data}} + p_{\color{orange}{G}}}{2} \right] \, + \, \mathbf{D}_{KL} \left[ p_{\color{orange}{G}} \; || \; \frac{p_{\text{data}} + p_{\color{orange}{G}}}{2} \right] - \log 4 \right) \\ \\ &\text{From Jensen-Shannon divergence: } \mathbf{D}_{JS}(p || q) = \frac{1}{2} \mathbf{D}_{KL}(p || \frac{p + q}{2}) + \frac{1}{2} \mathbf{D}_{KL}(q || \frac{p + q}{2}) \\ &= \min_{\color{orange}{G}} \, \left( 2 \, \mathbf{D}_{JS} (p_{\text{data}} \, || \, p_{\color{orange}{G}}) - \log 4 \right)\\ \end{align*}
JS divergence is always non-negative and is zero only when the two distributions are equal. Hence, training a GAN with this objective, minimizes the JS divergence to zero, obtaining a global minimum when $p_{\text{data}} = p_{G}$.
The global minimum of the minimax game happens when, \begin{align*} \text{Optimal Generator: } & p_{\text{data}} = p_{G} \\ \text{Optimal Discriminator: } & D_{G}^*(\mathbf{x}) = \frac{p_{\text{data}}(\mathbf{x})}{p_{\text{data}}(\mathbf{x}) + p_G(\mathbf{x})} = \frac{1}{2} \end{align*}
Some prominent papers,
- Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks (DCGAN): Architectures of Generator and Discriminator created with CNNs.
- Wasserstein GAN (WGAN): Improvement over traditional GAN training.
- Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks (CycleGAN): For image-to-image translation
Comments
Post a Comment