+Generative Adversarial Networks
+Generative adversarial networks (GANs) have shown great results in
+many generative tasks to replicate the real-world rich content such as
+images, human language, and music. It is inspired by game theory: two
+models, a generator and a discriminator, are competing with each other while
+making each other stronger at the same time. However, it is rather
+challenging to train a GANs model,
+training instability or failure to converge.
+
+
+Generative adversarial networks consist of two models (in their simplest form as two opposing feed forward neural networks)
+
+- A discriminator \( D \) estimates the probability of a given sample coming from the real dataset. It works as a critic and is optimized to tell the fake samples from the realo ones
+- A generator \( G \) outputs synthetic samples given a noise variable input \( z \) (\( z \) brings in potential output diversity). It is trained to capture the real data distribution in order to generate samples that can be as real as possible, or in other words, can trick the discriminator to offer a high probability.
+
+
+
At the end of the training, the generator can be used to generate for
+example new images. In this sense we have trained a model which can
+produce new samples. We say that we have implicitely defined a
+probability.
+
+
+These two models compete against each other during the training
+process: the generator \( G \) is trying hard to trick the discriminator,
+while the critic model \( D \) is trying hard not to be cheated. This
+interesting zero-sum game between two models motivates both to improve
+their functionalities.
+
+
+
+
+
+
+
+
+On one hand, we want to make sure the discriminator \( D \)'s decisions
+over real data are accurate by maximizing \( \mathbb{E}_{x \sim
+p_{r}(x)} [\log D(x)] \). Meanwhile, given a fake sample \( G(z), z \sim
+p_z(z) \), the discriminator is expected to output a probability,
+\( D(G(z)) \), close to zero by maximizing \( \mathbb{E}_{z \sim p_{z}(z)}
+[\log (1 - D(G(z)))] \).
+
+
+On the other hand, the generator is trained to increase the chances of
+\( D \) producing a high probability for a fake example, thus to minimize
+\( \mathbb{E}_{z \sim p_{z}(z)} [\log (1 - D(G(z)))] \).
+
+
+When combining both aspects together, \( D \) and \( G \) are playing a \textit{minimax game} in which we should optimize the following loss function:
+
+
+$$
+\begin{aligned}
+\min_G \max_D L(D, G)
+& = \mathbb{E}_{x \sim p_{r}(x)} [\log D(x)] + \mathbb{E}_{z \sim p_z(z)} [\log(1 - D(G(z)))] \\
+& = \mathbb{E}_{x \sim p_{r}(x)} [\log D(x)] + \mathbb{E}_{x \sim p_g(x)} [\log(1 - D(x)]
+\end{aligned}
+$$
+
+
+
where \( \mathbb{E}_{x \sim p_{r}(x)} [\log D(x)] \) has no impact on \( G \) during gradient descent updates.
+
+
+
More references
diff --git a/doc/pub/week15/html/week15-solarized.html b/doc/pub/week15/html/week15-solarized.html
index 8c5fd574..37e3193f 100644
--- a/doc/pub/week15/html/week15-solarized.html
+++ b/doc/pub/week15/html/week15-solarized.html
@@ -104,6 +104,14 @@
None,
'explicit-expression-for-the-derivative'),
('Final expression', 2, None, 'final-expression'),
+ ('Kullback-Leibler divergence',
+ 2,
+ None,
+ 'kullback-leibler-divergence'),
+ ('Jensen-Shannon divergence',
+ 2,
+ None,
+ 'jensen-shannon-divergence'),
('Generative model, basic overview (Borrowed from Rashcka et '
'al)',
2,
@@ -163,6 +171,15 @@
2,
None,
'steps-in-building-a-gan-borrowed-from-rashcka-et-al'),
+ ('Generative Adversarial Networks',
+ 2,
+ None,
+ 'generative-adversarial-networks'),
+ ('Optimal value for $D$', 2, None, 'optimal-value-for-d'),
+ ('What does the Loss Function Represent?',
+ 2,
+ None,
+ 'what-does-the-loss-function-represent'),
('More references', 2, None, 'more-references'),
('Writing Our First Generative Adversarial Network',
2,
@@ -245,7 +262,7 @@ Plans for the week of April 2
- Summary of Variational Autoencoders
- Generative Adversarial Networks (GANs), see https://lilianweng.github.io/posts/2017-08-20-gan/ for nice overview
-- Start discussion of diffusion models
+- Start discussion of diffusion models, motivation
- Video of lecture
- Whiteboard notes
@@ -468,6 +485,48 @@ Final expression
sampling as the standard sampling rule.
+
+Kullback-Leibler divergence
+
+
Before we continue, we need to remind ourselves about the
+Kullback-Leibler divergence introduced earlier. This will also allow
+us to introduce another measure used in connection with the training
+of Generative Adversarial Networks, the so-called Jensen-Shannon divergence..
+These metrics are useful for quantifying the similarity between two probability distributions.
+
+
+The Kullback–Leibler (KL) divergence, labeled \( D_{KL} \), measures how one probability distribution \( p \) diverges from a second expected probability distribution \( q \),
+that is
+
+$$
+D_{KL}(p \| q) = \int_x p(x) \log \frac{p(x)}{q(x)} dx.
+$$
+
+The KL-divegrnece \( D_{KL} \) achieves the minimum zero when \( p(x) == q(x) \) everywhere.
+
+Note that the KL divergence is asymmetric. In cases where \( p(x) \) is
+close to zero, but \( q(x) \) is significantly non-zero, the \( q \)'s effect
+is disregarded. It could cause buggy results when we just want to
+measure the similarity between two equally important distributions.
+
+
+
+Jensen-Shannon divergence
+
+The Jensen–Shannon (JS) divergence is another measure of similarity between
+two probability distributions, bounded by \( [0, 1] \). The JS-divergence is
+symmetric and more smooth than the KL-divergence.
+It is defined as
+
+$$
+D_{JS}(p \| q) = \frac{1}{2} D_{KL}(p \| \frac{p + q}{2}) + \frac{1}{2} D_{KL}(q \| \frac{p + q}{2})
+$$
+
+Many practitioners believe that one reason behind GANs' big success is
+switching the loss function from asymmetric KL-divergence in
+traditional maximum-likelihood approach to symmetric JS-divergence.
+
+
Generative model, basic overview (Borrowed from Rashcka et al)
@@ -560,14 +619,14 @@ The derivation from last week
$$
\begin{align*}
-\log p(\boldsymbol{x}) & = \log p(\boldsymbol{x}) \int q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})dz && \text{(Multiply by $1 = \int q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})d\boldsymbol{h}$)}\\
- & = \int q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})(\log p(\boldsymbol{x}))dz && \text{(Bring evidence into integral)}\\
+\log p(\boldsymbol{x}) & = \log p(\boldsymbol{x}) \int q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})dh && \text{(Multiply by $1 = \int q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})d\boldsymbol{h}$)}\\
+ & = \int q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})(\log p(\boldsymbol{x}))dh && \text{(Bring evidence into integral)}\\
& = \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\left[\log p(\boldsymbol{x})\right] && \text{(Definition of Expectation)}\\
& = \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\left[\log\frac{p(\boldsymbol{x}, \boldsymbol{h})}{p(\boldsymbol{h}|\boldsymbol{x})}\right]&& \\
& = \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\left[\log\frac{p(\boldsymbol{x}, \boldsymbol{h})q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}{p(\boldsymbol{h}|\boldsymbol{x})q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\right]&& \text{(Multiply by $1 = \frac{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}$)}\\
& = \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\left[\log\frac{p(\boldsymbol{x}, \boldsymbol{h})}{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\right] + \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\left[\log\frac{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}{p(\boldsymbol{h}|\boldsymbol{x})}\right] && \text{(Split the Expectation)}\\
& = \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\left[\log\frac{p(\boldsymbol{x}, \boldsymbol{h})}{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\right] +
- KL(q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})\vert\vert p(\boldsymbol{h}|\boldsymbol{x})) && \text{(Definition of KL Divergence)}\\
+ D_{KL}(q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})\vert\vert p(\boldsymbol{h}|\boldsymbol{x})) && \text{(Definition of KL Divergence)}\\
& \geq \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\left[\log\frac{p(\boldsymbol{x}, \boldsymbol{h})}{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\right] && \text{(KL Divergence always $\geq 0$)}
\end{align*}
$$
@@ -608,7 +667,7 @@ Dissecting the equations
{\mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\left[\log\frac{p(\boldsymbol{x}, \boldsymbol{h})}{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\right]}
&= {\mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\left[\log\frac{p_{\boldsymbol{\theta}}(\boldsymbol{x}|\boldsymbol{h})p(\boldsymbol{h})}{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\right]} && {\text{(Chain Rule of Probability)}}\\
&= {\mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\left[\log p_{\boldsymbol{\theta}}(\boldsymbol{x}|\boldsymbol{h})\right] + \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\left[\log\frac{p(\boldsymbol{h})}{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\right]} && {\text{(Split the Expectation)}}\\
-&= \underbrace{{\mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\left[\log p_{\boldsymbol{\theta}}(\boldsymbol{x}|\boldsymbol{h})\right]}}_\text{reconstruction term} - \underbrace{{KL(q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\vert\vert{p(\boldsymbol{h}))}}_\text{prior matching term} && {\text{(Definition of KL Divergence)}}
+&= \underbrace{{\mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\left[\log p_{\boldsymbol{\theta}}(\boldsymbol{x}|\boldsymbol{h})\right]}}_\text{reconstruction term} - \underbrace{{D_{KL}(q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\vert\vert{p(\boldsymbol{h}))}}_\text{prior matching term} && {\text{(Definition of KL Divergence)}}
\end{align*}
$$
@@ -657,7 +716,7 @@ Analytical evaluation
Then, the KL divergence term of the ELBO can be computed analytically, and the reconstruction term can be approximated using a Monte Carlo estimate. Our objective can then be rewritten as:
$$
\begin{align*}
- \mathrm{argmax}_{\boldsymbol{\phi}, \boldsymbol{\theta}} \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\left[\log p_{\boldsymbol{\theta}}(\boldsymbol{x}|\boldsymbol{h})\right] - KL(q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})\vert\vert p(\boldsymbol{h})) \approx \mathrm{argmax}_{\boldsymbol{\phi}, \boldsymbol{\theta}} \sum_{l=1}^{L}\log p_{\boldsymbol{\theta}}(\boldsymbol{x}|\boldsymbol{h}^{(l)}) - KL(q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})\vert\vert p(\boldsymbol{h}))
+ \mathrm{argmax}_{\boldsymbol{\phi}, \boldsymbol{\theta}} \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\left[\log p_{\boldsymbol{\theta}}(\boldsymbol{x}|\boldsymbol{h})\right] - D_{KL}(q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})\vert\vert p(\boldsymbol{h})) \approx \mathrm{argmax}_{\boldsymbol{\phi}, \boldsymbol{\theta}} \sum_{l=1}^{L}\log p_{\boldsymbol{\theta}}(\boldsymbol{x}|\boldsymbol{h}^{(l)}) - D_{KL}(q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})\vert\vert p(\boldsymbol{h}))
\end{align*}
$$
@@ -915,6 +974,140 @@ Steps in building a
+
+Generative Adversarial Networks
+
Generative adversarial networks (GANs) have shown great results in
+many generative tasks to replicate the real-world rich content such as
+images, human language, and music. It is inspired by game theory: two
+models, a generator and a discriminator, are competing with each other while
+making each other stronger at the same time. However, it is rather
+challenging to train a GANs model,
+training instability or failure to converge.
+
+
+Generative adversarial networks consist of two models (in their simplest form as two opposing feed forward neural networks)
+
+- A discriminator \( D \) estimates the probability of a given sample coming from the real dataset. It works as a critic and is optimized to tell the fake samples from the realo ones
+- A generator \( G \) outputs synthetic samples given a noise variable input \( z \) (\( z \) brings in potential output diversity). It is trained to capture the real data distribution in order to generate samples that can be as real as possible, or in other words, can trick the discriminator to offer a high probability.
+
+At the end of the training, the generator can be used to generate for
+example new images. In this sense we have trained a model which can
+produce new samples. We say that we have implicitely defined a
+probability.
+
+
+These two models compete against each other during the training
+process: the generator \( G \) is trying hard to trick the discriminator,
+while the critic model \( D \) is trying hard not to be cheated. This
+interesting zero-sum game between two models motivates both to improve
+their functionalities.
+
+
+
+
+
+
+
+
+On one hand, we want to make sure the discriminator \( D \)'s decisions
+over real data are accurate by maximizing \( \mathbb{E}_{x \sim
+p_{r}(x)} [\log D(x)] \). Meanwhile, given a fake sample \( G(z), z \sim
+p_z(z) \), the discriminator is expected to output a probability,
+\( D(G(z)) \), close to zero by maximizing \( \mathbb{E}_{z \sim p_{z}(z)}
+[\log (1 - D(G(z)))] \).
+
+
+On the other hand, the generator is trained to increase the chances of
+\( D \) producing a high probability for a fake example, thus to minimize
+\( \mathbb{E}_{z \sim p_{z}(z)} [\log (1 - D(G(z)))] \).
+
+
+When combining both aspects together, \( D \) and \( G \) are playing a \textit{minimax game} in which we should optimize the following loss function:
+
+$$
+\begin{aligned}
+\min_G \max_D L(D, G)
+& = \mathbb{E}_{x \sim p_{r}(x)} [\log D(x)] + \mathbb{E}_{z \sim p_z(z)} [\log(1 - D(G(z)))] \\
+& = \mathbb{E}_{x \sim p_{r}(x)} [\log D(x)] + \mathbb{E}_{x \sim p_g(x)} [\log(1 - D(x)]
+\end{aligned}
+$$
+
+where \( \mathbb{E}_{x \sim p_{r}(x)} [\log D(x)] \) has no impact on \( G \) during gradient descent updates.
+
+
+Optimal value for \( D \)
+
+Now we have a well-defined loss function. Let's first examine what is the best value for \( D \).
+
+$$
+L(G, D) = \int_x \bigg( p_{r}(x) \log(D(x)) + p_g (x) \log(1 - D(x)) \bigg) dx
+$$
+
+Since we are interested in what is the best value of \( D(x) \) to maximize \( L(G, D) \), let us label
+
+$$
+\tilde{x} = D(x),
+A=p_{r}(x),
+B=p_g(x)
+$$
+
+And then what is inside the integral (we can safely ignore the integral because \( x \) is sampled over all the possible values) is:
+
+$$
+\begin{align*}
+f(\tilde{x})
+& = A log\tilde{x} + B log(1-\tilde{x}) \\
+\frac{d f(\tilde{x})}{d \tilde{x}}
+& = A \frac{1}{ln10} \frac{1}{\tilde{x}} - B \frac{1}{ln10} \frac{1}{1 - \tilde{x}} \\
+& = \frac{1}{ln10} (\frac{A}{\tilde{x}} - \frac{B}{1-\tilde{x}}) \\
+& = \frac{1}{ln10} \frac{A - (A + B)\tilde{x}}{\tilde{x} (1 - \tilde{x})} \\
+\end{align*}
+$$
+
+Thus, set \( \frac{d f(\tilde{x})}{d \tilde{x}} = 0 \), we get the best value of the discriminator: \( D^*(x) = \tilde{x}^* = \frac{A}{A + B} = \frac{p_{r}(x)}{p_{r}(x) + p_g(x)} \in [0, 1] \).
+Once the generator is trained to its optimal, \( p_g \) gets very close to \( p_{r} \). When \( p_g = p_{r} \), \( D^*(x) \) becomes \( 1/2 \).
+
+
+When both \( G \) and \( D \) are at their optimal values, we have \( p_g = p_{r} \) and \( D^*(x) = 1/2 \) and the loss function becomes:
+
+$$
+\begin{align*}
+L(G, D^*)
+&= \int_x \bigg( p_{r}(x) \log(D^*(x)) + p_g (x) \log(1 - D^*(x)) \bigg) dx \\
+&= \log \frac{1}{2} \int_x p_{r}(x) dx + \log \frac{1}{2} \int_x p_g(x) dx \\
+&= -2\log2
+\end{align*}
+$$
+
+
+
+What does the Loss Function Represent?
+
+The JS divergence between \( p_{r} \) and \( p_g \) can be computed as:
+
+$$
+\begin{align*}
+D_{JS}(p_{r} \| p_g)
+=& \frac{1}{2} D_{KL}(p_{r} || \frac{p_{r} + p_g}{2}) + \frac{1}{2} D_{KL}(p_{g} || \frac{p_{r} + p_g}{2}) \\
+=& \frac{1}{2} \bigg( \log2 + \int_x p_{r}(x) \log \frac{p_{r}(x)}{p_{r} + p_g(x)} dx \bigg) + \\& \frac{1}{2} \bigg( \log2 + \int_x p_g(x) \log \frac{p_g(x)}{p_{r} + p_g(x)} dx \bigg) \\
+=& \frac{1}{2} \bigg( \log4 + L(G, D^*) \bigg)
+\end{align*}
+$$
+
+Thus,
+
+$$
+L(G, D^*) = 2D_{JS}(p_{r} \| p_g) - 2\log2
+$$
+
+Essentially the loss function of GAN quantifies the similarity between
+the generative data distribution \( p_g \) and the real sample
+distribution \( p_{r} \) by JS divergence when the discriminator is
+optimal. The best \( G^* \) that replicates the real data distribution
+leads to the minimum \( L(G^*, D^*) = -2\log2 \) which is aligned with
+equations above.
+
+
More references
@@ -1630,6 +1823,7 @@ Diffusion learning
smooth target distribution, this method can capture data distributions
of arbitrary form.
+
© 1999-2024, Morten Hjorth-Jensen. Released under CC Attribution-NonCommercial 4.0 license
diff --git a/doc/pub/week15/html/week15.html b/doc/pub/week15/html/week15.html
index 9bb01f5f..39303bd7 100644
--- a/doc/pub/week15/html/week15.html
+++ b/doc/pub/week15/html/week15.html
@@ -181,6 +181,14 @@
None,
'explicit-expression-for-the-derivative'),
('Final expression', 2, None, 'final-expression'),
+ ('Kullback-Leibler divergence',
+ 2,
+ None,
+ 'kullback-leibler-divergence'),
+ ('Jensen-Shannon divergence',
+ 2,
+ None,
+ 'jensen-shannon-divergence'),
('Generative model, basic overview (Borrowed from Rashcka et '
'al)',
2,
@@ -240,6 +248,15 @@
2,
None,
'steps-in-building-a-gan-borrowed-from-rashcka-et-al'),
+ ('Generative Adversarial Networks',
+ 2,
+ None,
+ 'generative-adversarial-networks'),
+ ('Optimal value for $D$', 2, None, 'optimal-value-for-d'),
+ ('What does the Loss Function Represent?',
+ 2,
+ None,
+ 'what-does-the-loss-function-represent'),
('More references', 2, None, 'more-references'),
('Writing Our First Generative Adversarial Network',
2,
@@ -322,7 +339,7 @@ Plans for the week of April 2
- Summary of Variational Autoencoders
- Generative Adversarial Networks (GANs), see https://lilianweng.github.io/posts/2017-08-20-gan/ for nice overview
-- Start discussion of diffusion models
+- Start discussion of diffusion models, motivation
- Video of lecture
- Whiteboard notes
@@ -545,6 +562,48 @@ Final expression
sampling as the standard sampling rule.
+
+Kullback-Leibler divergence
+
+
Before we continue, we need to remind ourselves about the
+Kullback-Leibler divergence introduced earlier. This will also allow
+us to introduce another measure used in connection with the training
+of Generative Adversarial Networks, the so-called Jensen-Shannon divergence..
+These metrics are useful for quantifying the similarity between two probability distributions.
+
+
+The Kullback–Leibler (KL) divergence, labeled \( D_{KL} \), measures how one probability distribution \( p \) diverges from a second expected probability distribution \( q \),
+that is
+
+$$
+D_{KL}(p \| q) = \int_x p(x) \log \frac{p(x)}{q(x)} dx.
+$$
+
+The KL-divegrnece \( D_{KL} \) achieves the minimum zero when \( p(x) == q(x) \) everywhere.
+
+Note that the KL divergence is asymmetric. In cases where \( p(x) \) is
+close to zero, but \( q(x) \) is significantly non-zero, the \( q \)'s effect
+is disregarded. It could cause buggy results when we just want to
+measure the similarity between two equally important distributions.
+
+
+
+Jensen-Shannon divergence
+
+The Jensen–Shannon (JS) divergence is another measure of similarity between
+two probability distributions, bounded by \( [0, 1] \). The JS-divergence is
+symmetric and more smooth than the KL-divergence.
+It is defined as
+
+$$
+D_{JS}(p \| q) = \frac{1}{2} D_{KL}(p \| \frac{p + q}{2}) + \frac{1}{2} D_{KL}(q \| \frac{p + q}{2})
+$$
+
+Many practitioners believe that one reason behind GANs' big success is
+switching the loss function from asymmetric KL-divergence in
+traditional maximum-likelihood approach to symmetric JS-divergence.
+
+
Generative model, basic overview (Borrowed from Rashcka et al)
@@ -637,14 +696,14 @@ The derivation from last week
$$
\begin{align*}
-\log p(\boldsymbol{x}) & = \log p(\boldsymbol{x}) \int q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})dz && \text{(Multiply by $1 = \int q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})d\boldsymbol{h}$)}\\
- & = \int q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})(\log p(\boldsymbol{x}))dz && \text{(Bring evidence into integral)}\\
+\log p(\boldsymbol{x}) & = \log p(\boldsymbol{x}) \int q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})dh && \text{(Multiply by $1 = \int q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})d\boldsymbol{h}$)}\\
+ & = \int q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})(\log p(\boldsymbol{x}))dh && \text{(Bring evidence into integral)}\\
& = \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\left[\log p(\boldsymbol{x})\right] && \text{(Definition of Expectation)}\\
& = \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\left[\log\frac{p(\boldsymbol{x}, \boldsymbol{h})}{p(\boldsymbol{h}|\boldsymbol{x})}\right]&& \\
& = \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\left[\log\frac{p(\boldsymbol{x}, \boldsymbol{h})q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}{p(\boldsymbol{h}|\boldsymbol{x})q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\right]&& \text{(Multiply by $1 = \frac{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}$)}\\
& = \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\left[\log\frac{p(\boldsymbol{x}, \boldsymbol{h})}{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\right] + \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\left[\log\frac{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}{p(\boldsymbol{h}|\boldsymbol{x})}\right] && \text{(Split the Expectation)}\\
& = \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\left[\log\frac{p(\boldsymbol{x}, \boldsymbol{h})}{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\right] +
- KL(q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})\vert\vert p(\boldsymbol{h}|\boldsymbol{x})) && \text{(Definition of KL Divergence)}\\
+ D_{KL}(q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})\vert\vert p(\boldsymbol{h}|\boldsymbol{x})) && \text{(Definition of KL Divergence)}\\
& \geq \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\left[\log\frac{p(\boldsymbol{x}, \boldsymbol{h})}{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\right] && \text{(KL Divergence always $\geq 0$)}
\end{align*}
$$
@@ -685,7 +744,7 @@ Dissecting the equations
{\mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\left[\log\frac{p(\boldsymbol{x}, \boldsymbol{h})}{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\right]}
&= {\mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\left[\log\frac{p_{\boldsymbol{\theta}}(\boldsymbol{x}|\boldsymbol{h})p(\boldsymbol{h})}{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\right]} && {\text{(Chain Rule of Probability)}}\\
&= {\mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\left[\log p_{\boldsymbol{\theta}}(\boldsymbol{x}|\boldsymbol{h})\right] + \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\left[\log\frac{p(\boldsymbol{h})}{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\right]} && {\text{(Split the Expectation)}}\\
-&= \underbrace{{\mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\left[\log p_{\boldsymbol{\theta}}(\boldsymbol{x}|\boldsymbol{h})\right]}}_\text{reconstruction term} - \underbrace{{KL(q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\vert\vert{p(\boldsymbol{h}))}}_\text{prior matching term} && {\text{(Definition of KL Divergence)}}
+&= \underbrace{{\mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\left[\log p_{\boldsymbol{\theta}}(\boldsymbol{x}|\boldsymbol{h})\right]}}_\text{reconstruction term} - \underbrace{{D_{KL}(q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\vert\vert{p(\boldsymbol{h}))}}_\text{prior matching term} && {\text{(Definition of KL Divergence)}}
\end{align*}
$$
@@ -734,7 +793,7 @@ Analytical evaluation
Then, the KL divergence term of the ELBO can be computed analytically, and the reconstruction term can be approximated using a Monte Carlo estimate. Our objective can then be rewritten as:
$$
\begin{align*}
- \mathrm{argmax}_{\boldsymbol{\phi}, \boldsymbol{\theta}} \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\left[\log p_{\boldsymbol{\theta}}(\boldsymbol{x}|\boldsymbol{h})\right] - KL(q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})\vert\vert p(\boldsymbol{h})) \approx \mathrm{argmax}_{\boldsymbol{\phi}, \boldsymbol{\theta}} \sum_{l=1}^{L}\log p_{\boldsymbol{\theta}}(\boldsymbol{x}|\boldsymbol{h}^{(l)}) - KL(q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})\vert\vert p(\boldsymbol{h}))
+ \mathrm{argmax}_{\boldsymbol{\phi}, \boldsymbol{\theta}} \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\left[\log p_{\boldsymbol{\theta}}(\boldsymbol{x}|\boldsymbol{h})\right] - D_{KL}(q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})\vert\vert p(\boldsymbol{h})) \approx \mathrm{argmax}_{\boldsymbol{\phi}, \boldsymbol{\theta}} \sum_{l=1}^{L}\log p_{\boldsymbol{\theta}}(\boldsymbol{x}|\boldsymbol{h}^{(l)}) - D_{KL}(q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})\vert\vert p(\boldsymbol{h}))
\end{align*}
$$
@@ -992,6 +1051,140 @@ Steps in building a
+
+Generative Adversarial Networks
+Generative adversarial networks (GANs) have shown great results in
+many generative tasks to replicate the real-world rich content such as
+images, human language, and music. It is inspired by game theory: two
+models, a generator and a discriminator, are competing with each other while
+making each other stronger at the same time. However, it is rather
+challenging to train a GANs model,
+training instability or failure to converge.
+
+
+Generative adversarial networks consist of two models (in their simplest form as two opposing feed forward neural networks)
+
+- A discriminator \( D \) estimates the probability of a given sample coming from the real dataset. It works as a critic and is optimized to tell the fake samples from the realo ones
+- A generator \( G \) outputs synthetic samples given a noise variable input \( z \) (\( z \) brings in potential output diversity). It is trained to capture the real data distribution in order to generate samples that can be as real as possible, or in other words, can trick the discriminator to offer a high probability.
+
+At the end of the training, the generator can be used to generate for
+example new images. In this sense we have trained a model which can
+produce new samples. We say that we have implicitely defined a
+probability.
+
+
+These two models compete against each other during the training
+process: the generator \( G \) is trying hard to trick the discriminator,
+while the critic model \( D \) is trying hard not to be cheated. This
+interesting zero-sum game between two models motivates both to improve
+their functionalities.
+
+
+
+
+
+
+
+
+On one hand, we want to make sure the discriminator \( D \)'s decisions
+over real data are accurate by maximizing \( \mathbb{E}_{x \sim
+p_{r}(x)} [\log D(x)] \). Meanwhile, given a fake sample \( G(z), z \sim
+p_z(z) \), the discriminator is expected to output a probability,
+\( D(G(z)) \), close to zero by maximizing \( \mathbb{E}_{z \sim p_{z}(z)}
+[\log (1 - D(G(z)))] \).
+
+
+On the other hand, the generator is trained to increase the chances of
+\( D \) producing a high probability for a fake example, thus to minimize
+\( \mathbb{E}_{z \sim p_{z}(z)} [\log (1 - D(G(z)))] \).
+
+
+When combining both aspects together, \( D \) and \( G \) are playing a \textit{minimax game} in which we should optimize the following loss function:
+
+$$
+\begin{aligned}
+\min_G \max_D L(D, G)
+& = \mathbb{E}_{x \sim p_{r}(x)} [\log D(x)] + \mathbb{E}_{z \sim p_z(z)} [\log(1 - D(G(z)))] \\
+& = \mathbb{E}_{x \sim p_{r}(x)} [\log D(x)] + \mathbb{E}_{x \sim p_g(x)} [\log(1 - D(x)]
+\end{aligned}
+$$
+
+where \( \mathbb{E}_{x \sim p_{r}(x)} [\log D(x)] \) has no impact on \( G \) during gradient descent updates.
+
+
+Optimal value for \( D \)
+
+Now we have a well-defined loss function. Let's first examine what is the best value for \( D \).
+
+$$
+L(G, D) = \int_x \bigg( p_{r}(x) \log(D(x)) + p_g (x) \log(1 - D(x)) \bigg) dx
+$$
+
+Since we are interested in what is the best value of \( D(x) \) to maximize \( L(G, D) \), let us label
+
+$$
+\tilde{x} = D(x),
+A=p_{r}(x),
+B=p_g(x)
+$$
+
+And then what is inside the integral (we can safely ignore the integral because \( x \) is sampled over all the possible values) is:
+
+$$
+\begin{align*}
+f(\tilde{x})
+& = A log\tilde{x} + B log(1-\tilde{x}) \\
+\frac{d f(\tilde{x})}{d \tilde{x}}
+& = A \frac{1}{ln10} \frac{1}{\tilde{x}} - B \frac{1}{ln10} \frac{1}{1 - \tilde{x}} \\
+& = \frac{1}{ln10} (\frac{A}{\tilde{x}} - \frac{B}{1-\tilde{x}}) \\
+& = \frac{1}{ln10} \frac{A - (A + B)\tilde{x}}{\tilde{x} (1 - \tilde{x})} \\
+\end{align*}
+$$
+
+Thus, set \( \frac{d f(\tilde{x})}{d \tilde{x}} = 0 \), we get the best value of the discriminator: \( D^*(x) = \tilde{x}^* = \frac{A}{A + B} = \frac{p_{r}(x)}{p_{r}(x) + p_g(x)} \in [0, 1] \).
+Once the generator is trained to its optimal, \( p_g \) gets very close to \( p_{r} \). When \( p_g = p_{r} \), \( D^*(x) \) becomes \( 1/2 \).
+
+
+When both \( G \) and \( D \) are at their optimal values, we have \( p_g = p_{r} \) and \( D^*(x) = 1/2 \) and the loss function becomes:
+
+$$
+\begin{align*}
+L(G, D^*)
+&= \int_x \bigg( p_{r}(x) \log(D^*(x)) + p_g (x) \log(1 - D^*(x)) \bigg) dx \\
+&= \log \frac{1}{2} \int_x p_{r}(x) dx + \log \frac{1}{2} \int_x p_g(x) dx \\
+&= -2\log2
+\end{align*}
+$$
+
+
+
+What does the Loss Function Represent?
+
+The JS divergence between \( p_{r} \) and \( p_g \) can be computed as:
+
+$$
+\begin{align*}
+D_{JS}(p_{r} \| p_g)
+=& \frac{1}{2} D_{KL}(p_{r} || \frac{p_{r} + p_g}{2}) + \frac{1}{2} D_{KL}(p_{g} || \frac{p_{r} + p_g}{2}) \\
+=& \frac{1}{2} \bigg( \log2 + \int_x p_{r}(x) \log \frac{p_{r}(x)}{p_{r} + p_g(x)} dx \bigg) + \\& \frac{1}{2} \bigg( \log2 + \int_x p_g(x) \log \frac{p_g(x)}{p_{r} + p_g(x)} dx \bigg) \\
+=& \frac{1}{2} \bigg( \log4 + L(G, D^*) \bigg)
+\end{align*}
+$$
+
+Thus,
+
+$$
+L(G, D^*) = 2D_{JS}(p_{r} \| p_g) - 2\log2
+$$
+
+Essentially the loss function of GAN quantifies the similarity between
+the generative data distribution \( p_g \) and the real sample
+distribution \( p_{r} \) by JS divergence when the discriminator is
+optimal. The best \( G^* \) that replicates the real data distribution
+leads to the minimum \( L(G^*, D^*) = -2\log2 \) which is aligned with
+equations above.
+
+
More references
@@ -1707,6 +1900,7 @@ Diffusion learning
smooth target distribution, this method can capture data distributions
of arbitrary form.
+
© 1999-2024, Morten Hjorth-Jensen. Released under CC Attribution-NonCommercial 4.0 license
diff --git a/doc/pub/week15/ipynb/ipynb-week15-src.tar.gz b/doc/pub/week15/ipynb/ipynb-week15-src.tar.gz
index 831ba04f..77b6090a 100644
Binary files a/doc/pub/week15/ipynb/ipynb-week15-src.tar.gz and b/doc/pub/week15/ipynb/ipynb-week15-src.tar.gz differ
diff --git a/doc/pub/week15/ipynb/week15.ipynb b/doc/pub/week15/ipynb/week15.ipynb
index 46204ccc..b91d3d41 100644
--- a/doc/pub/week15/ipynb/week15.ipynb
+++ b/doc/pub/week15/ipynb/week15.ipynb
@@ -2,7 +2,7 @@
"cells": [
{
"cell_type": "markdown",
- "id": "bbc251d3",
+ "id": "c7cf0fce",
"metadata": {
"editable": true
},
@@ -14,7 +14,7 @@
},
{
"cell_type": "markdown",
- "id": "8154bff2",
+ "id": "e66f7980",
"metadata": {
"editable": true
},
@@ -27,7 +27,7 @@
},
{
"cell_type": "markdown",
- "id": "30bdfd80",
+ "id": "b25a54f4",
"metadata": {
"editable": true
},
@@ -40,7 +40,7 @@
"\n",
"2. Generative Adversarial Networks (GANs), see for nice overview\n",
"\n",
- "3. Start discussion of diffusion models\n",
+ "3. Start discussion of diffusion models, motivation\n",
"\n",
"4. [Video of lecture](https://youtu.be/Cg8n9aWwHuU)\n",
"\n",
@@ -49,7 +49,7 @@
},
{
"cell_type": "markdown",
- "id": "b0ea3794",
+ "id": "7304c3a4",
"metadata": {
"editable": true
},
@@ -65,7 +65,7 @@
},
{
"cell_type": "markdown",
- "id": "32939787",
+ "id": "1f29a278",
"metadata": {
"editable": true
},
@@ -81,7 +81,7 @@
},
{
"cell_type": "markdown",
- "id": "89bf1911",
+ "id": "ad62142e",
"metadata": {
"editable": true
},
@@ -91,7 +91,7 @@
},
{
"cell_type": "markdown",
- "id": "f04c1132",
+ "id": "3c906e0c",
"metadata": {
"editable": true
},
@@ -103,7 +103,7 @@
},
{
"cell_type": "markdown",
- "id": "306c3a9e",
+ "id": "9e92737f",
"metadata": {
"editable": true
},
@@ -115,7 +115,7 @@
},
{
"cell_type": "markdown",
- "id": "38d85424",
+ "id": "ea1bf8fd",
"metadata": {
"editable": true
},
@@ -125,7 +125,7 @@
},
{
"cell_type": "markdown",
- "id": "72d0e5a4",
+ "id": "1d00646e",
"metadata": {
"editable": true
},
@@ -137,7 +137,7 @@
},
{
"cell_type": "markdown",
- "id": "edc35f6e",
+ "id": "a3f6475d",
"metadata": {
"editable": true
},
@@ -149,7 +149,7 @@
},
{
"cell_type": "markdown",
- "id": "eab97d57",
+ "id": "8f95be62",
"metadata": {
"editable": true
},
@@ -163,7 +163,7 @@
},
{
"cell_type": "markdown",
- "id": "e36868b7",
+ "id": "184db513",
"metadata": {
"editable": true
},
@@ -175,7 +175,7 @@
},
{
"cell_type": "markdown",
- "id": "05ccd2ab",
+ "id": "a9270d8e",
"metadata": {
"editable": true
},
@@ -187,7 +187,7 @@
},
{
"cell_type": "markdown",
- "id": "e5f7b578",
+ "id": "e58357fd",
"metadata": {
"editable": true
},
@@ -199,7 +199,7 @@
},
{
"cell_type": "markdown",
- "id": "fdc48043",
+ "id": "6b570b28",
"metadata": {
"editable": true
},
@@ -209,7 +209,7 @@
},
{
"cell_type": "markdown",
- "id": "32dd3ff0",
+ "id": "c938f39d",
"metadata": {
"editable": true
},
@@ -221,7 +221,7 @@
},
{
"cell_type": "markdown",
- "id": "5e333dc2",
+ "id": "13461367",
"metadata": {
"editable": true
},
@@ -236,7 +236,7 @@
},
{
"cell_type": "markdown",
- "id": "9d4996e0",
+ "id": "1050da2f",
"metadata": {
"editable": true
},
@@ -248,7 +248,7 @@
},
{
"cell_type": "markdown",
- "id": "bfbcdfd5",
+ "id": "094c2485",
"metadata": {
"editable": true
},
@@ -258,7 +258,7 @@
},
{
"cell_type": "markdown",
- "id": "2bac65fd",
+ "id": "7a888b05",
"metadata": {
"editable": true
},
@@ -270,7 +270,7 @@
},
{
"cell_type": "markdown",
- "id": "f513d787",
+ "id": "27b7e939",
"metadata": {
"editable": true
},
@@ -283,7 +283,7 @@
},
{
"cell_type": "markdown",
- "id": "7344a35e",
+ "id": "0a367b8d",
"metadata": {
"editable": true
},
@@ -295,7 +295,7 @@
},
{
"cell_type": "markdown",
- "id": "135b8bb8",
+ "id": "67e2aac1",
"metadata": {
"editable": true
},
@@ -307,7 +307,7 @@
},
{
"cell_type": "markdown",
- "id": "aea87bfd",
+ "id": "552dd312",
"metadata": {
"editable": true
},
@@ -317,7 +317,7 @@
},
{
"cell_type": "markdown",
- "id": "cce8712d",
+ "id": "3bc3b94c",
"metadata": {
"editable": true
},
@@ -329,7 +329,7 @@
},
{
"cell_type": "markdown",
- "id": "5bcfef5b",
+ "id": "008ec2d0",
"metadata": {
"editable": true
},
@@ -341,7 +341,7 @@
},
{
"cell_type": "markdown",
- "id": "665445d0",
+ "id": "90a0b8a3",
"metadata": {
"editable": true
},
@@ -353,7 +353,7 @@
},
{
"cell_type": "markdown",
- "id": "bdc994e9",
+ "id": "fce8a947",
"metadata": {
"editable": true
},
@@ -364,7 +364,7 @@
},
{
"cell_type": "markdown",
- "id": "97db2763",
+ "id": "aed7c7b0",
"metadata": {
"editable": true
},
@@ -376,7 +376,7 @@
},
{
"cell_type": "markdown",
- "id": "6274f6f9",
+ "id": "f59709c5",
"metadata": {
"editable": true
},
@@ -390,7 +390,7 @@
},
{
"cell_type": "markdown",
- "id": "a8c268f3",
+ "id": "b6a46f04",
"metadata": {
"editable": true
},
@@ -402,7 +402,7 @@
},
{
"cell_type": "markdown",
- "id": "5fb324c3",
+ "id": "d309349e",
"metadata": {
"editable": true
},
@@ -412,7 +412,7 @@
},
{
"cell_type": "markdown",
- "id": "3ab2b400",
+ "id": "a29d8581",
"metadata": {
"editable": true
},
@@ -424,7 +424,7 @@
},
{
"cell_type": "markdown",
- "id": "ad53bab5",
+ "id": "d844c4dd",
"metadata": {
"editable": true
},
@@ -436,7 +436,7 @@
},
{
"cell_type": "markdown",
- "id": "80022c49",
+ "id": "684424fc",
"metadata": {
"editable": true
},
@@ -448,7 +448,7 @@
},
{
"cell_type": "markdown",
- "id": "ab06bb70",
+ "id": "540c038b",
"metadata": {
"editable": true
},
@@ -459,7 +459,7 @@
},
{
"cell_type": "markdown",
- "id": "7f086749",
+ "id": "1dd6f171",
"metadata": {
"editable": true
},
@@ -474,7 +474,7 @@
},
{
"cell_type": "markdown",
- "id": "65f4974d",
+ "id": "99fdc241",
"metadata": {
"editable": true
},
@@ -486,7 +486,7 @@
},
{
"cell_type": "markdown",
- "id": "01b6074d",
+ "id": "0366d1a3",
"metadata": {
"editable": true
},
@@ -498,7 +498,7 @@
},
{
"cell_type": "markdown",
- "id": "abc68191",
+ "id": "0ad245f3",
"metadata": {
"editable": true
},
@@ -508,7 +508,7 @@
},
{
"cell_type": "markdown",
- "id": "42d55ede",
+ "id": "62a879d0",
"metadata": {
"editable": true
},
@@ -519,7 +519,7 @@
},
{
"cell_type": "markdown",
- "id": "2713bc10",
+ "id": "2dd7dc01",
"metadata": {
"editable": true
},
@@ -531,7 +531,7 @@
},
{
"cell_type": "markdown",
- "id": "bcbc944d",
+ "id": "afd4999e",
"metadata": {
"editable": true
},
@@ -541,7 +541,7 @@
},
{
"cell_type": "markdown",
- "id": "f470f0b7",
+ "id": "610cb88e",
"metadata": {
"editable": true
},
@@ -553,7 +553,7 @@
},
{
"cell_type": "markdown",
- "id": "6cc4dc33",
+ "id": "3093f6b2",
"metadata": {
"editable": true
},
@@ -565,7 +565,7 @@
},
{
"cell_type": "markdown",
- "id": "b8d79f98",
+ "id": "ffac2f38",
"metadata": {
"editable": true
},
@@ -577,7 +577,7 @@
},
{
"cell_type": "markdown",
- "id": "133d8b3f",
+ "id": "c3d13ab4",
"metadata": {
"editable": true
},
@@ -589,7 +589,7 @@
},
{
"cell_type": "markdown",
- "id": "9fc2ebb7",
+ "id": "28f81e5e",
"metadata": {
"editable": true
},
@@ -601,7 +601,7 @@
},
{
"cell_type": "markdown",
- "id": "431a67c4",
+ "id": "62fc9e76",
"metadata": {
"editable": true
},
@@ -611,7 +611,7 @@
},
{
"cell_type": "markdown",
- "id": "3df67560",
+ "id": "f4c94f15",
"metadata": {
"editable": true
},
@@ -623,7 +623,7 @@
},
{
"cell_type": "markdown",
- "id": "0838e65c",
+ "id": "8d9acf3f",
"metadata": {
"editable": true
},
@@ -633,7 +633,7 @@
},
{
"cell_type": "markdown",
- "id": "822bf5ac",
+ "id": "b15612ff",
"metadata": {
"editable": true
},
@@ -645,7 +645,7 @@
},
{
"cell_type": "markdown",
- "id": "a93722a2",
+ "id": "e1390374",
"metadata": {
"editable": true
},
@@ -656,7 +656,92 @@
},
{
"cell_type": "markdown",
- "id": "d6d8c42e",
+ "id": "f391425b",
+ "metadata": {
+ "editable": true
+ },
+ "source": [
+ "## Kullback-Leibler divergence\n",
+ "\n",
+ "Before we continue, we need to remind ourselves about the\n",
+ "Kullback-Leibler divergence introduced earlier. This will also allow\n",
+ "us to introduce another measure used in connection with the training\n",
+ "of Generative Adversarial Networks, the so-called Jensen-Shannon divergence..\n",
+ "These metrics are useful for quantifying the similarity between two probability distributions.\n",
+ "\n",
+ "The Kullback–Leibler (KL) divergence, labeled $D_{KL}$, measures how one probability distribution $p$ diverges from a second expected probability distribution $q$,\n",
+ "that is"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "71586dc3",
+ "metadata": {
+ "editable": true
+ },
+ "source": [
+ "$$\n",
+ "D_{KL}(p \\| q) = \\int_x p(x) \\log \\frac{p(x)}{q(x)} dx.\n",
+ "$$"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6042b046",
+ "metadata": {
+ "editable": true
+ },
+ "source": [
+ "The KL-divegrnece $D_{KL}$ achieves the minimum zero when $p(x) == q(x)$ everywhere.\n",
+ "\n",
+ "Note that the KL divergence is asymmetric. In cases where $p(x)$ is\n",
+ "close to zero, but $q(x)$ is significantly non-zero, the $q$'s effect\n",
+ "is disregarded. It could cause buggy results when we just want to\n",
+ "measure the similarity between two equally important distributions."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d20cf644",
+ "metadata": {
+ "editable": true
+ },
+ "source": [
+ "## Jensen-Shannon divergence\n",
+ "\n",
+ "The Jensen–Shannon (JS) divergence is another measure of similarity between\n",
+ "two probability distributions, bounded by $[0, 1]$. The JS-divergence is\n",
+ "symmetric and more smooth than the KL-divergence.\n",
+ "It is defined as"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1d9ded24",
+ "metadata": {
+ "editable": true
+ },
+ "source": [
+ "$$\n",
+ "D_{JS}(p \\| q) = \\frac{1}{2} D_{KL}(p \\| \\frac{p + q}{2}) + \\frac{1}{2} D_{KL}(q \\| \\frac{p + q}{2})\n",
+ "$$"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e42178b6",
+ "metadata": {
+ "editable": true
+ },
+ "source": [
+ "Many practitioners believe that one reason behind GANs' big success is\n",
+ "switching the loss function from asymmetric KL-divergence in\n",
+ "traditional maximum-likelihood approach to symmetric JS-divergence."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "09f95d9a",
"metadata": {
"editable": true
},
@@ -672,7 +757,7 @@
},
{
"cell_type": "markdown",
- "id": "f17356a3",
+ "id": "18a2b94d",
"metadata": {
"editable": true
},
@@ -691,7 +776,7 @@
},
{
"cell_type": "markdown",
- "id": "27fb099f",
+ "id": "19583482",
"metadata": {
"editable": true
},
@@ -703,7 +788,7 @@
},
{
"cell_type": "markdown",
- "id": "17548236",
+ "id": "0836982d",
"metadata": {
"editable": true
},
@@ -713,7 +798,7 @@
},
{
"cell_type": "markdown",
- "id": "3f05b416",
+ "id": "ae4be538",
"metadata": {
"editable": true
},
@@ -725,7 +810,7 @@
},
{
"cell_type": "markdown",
- "id": "2cf43c02",
+ "id": "7ffb8c95",
"metadata": {
"editable": true
},
@@ -735,7 +820,7 @@
},
{
"cell_type": "markdown",
- "id": "4c9b8c03",
+ "id": "62aa7b51",
"metadata": {
"editable": true
},
@@ -757,7 +842,7 @@
},
{
"cell_type": "markdown",
- "id": "587e829b",
+ "id": "f8b3736f",
"metadata": {
"editable": true
},
@@ -768,7 +853,7 @@
},
{
"cell_type": "markdown",
- "id": "6cec0936",
+ "id": "780e9d0f",
"metadata": {
"editable": true
},
@@ -780,7 +865,7 @@
},
{
"cell_type": "markdown",
- "id": "008b0d93",
+ "id": "216c8f3a",
"metadata": {
"editable": true
},
@@ -790,7 +875,7 @@
},
{
"cell_type": "markdown",
- "id": "03ec09b6",
+ "id": "05117498",
"metadata": {
"editable": true
},
@@ -802,7 +887,7 @@
},
{
"cell_type": "markdown",
- "id": "ee878746",
+ "id": "e095ece6",
"metadata": {
"editable": true
},
@@ -823,7 +908,7 @@
},
{
"cell_type": "markdown",
- "id": "be988b7e",
+ "id": "447b2be9",
"metadata": {
"editable": true
},
@@ -835,21 +920,21 @@
},
{
"cell_type": "markdown",
- "id": "d06829bb",
+ "id": "709128f2",
"metadata": {
"editable": true
},
"source": [
"$$\n",
"\\begin{align*}\n",
- "\\log p(\\boldsymbol{x}) & = \\log p(\\boldsymbol{x}) \\int q_{\\boldsymbol{\\phi}}(\\boldsymbol{h}|\\boldsymbol{x})dz && \\text{(Multiply by $1 = \\int q_{\\boldsymbol{\\phi}}(\\boldsymbol{h}|\\boldsymbol{x})d\\boldsymbol{h}$)}\\\\\n",
- " & = \\int q_{\\boldsymbol{\\phi}}(\\boldsymbol{h}|\\boldsymbol{x})(\\log p(\\boldsymbol{x}))dz && \\text{(Bring evidence into integral)}\\\\\n",
+ "\\log p(\\boldsymbol{x}) & = \\log p(\\boldsymbol{x}) \\int q_{\\boldsymbol{\\phi}}(\\boldsymbol{h}|\\boldsymbol{x})dh && \\text{(Multiply by $1 = \\int q_{\\boldsymbol{\\phi}}(\\boldsymbol{h}|\\boldsymbol{x})d\\boldsymbol{h}$)}\\\\\n",
+ " & = \\int q_{\\boldsymbol{\\phi}}(\\boldsymbol{h}|\\boldsymbol{x})(\\log p(\\boldsymbol{x}))dh && \\text{(Bring evidence into integral)}\\\\\n",
" & = \\mathbb{E}_{q_{\\boldsymbol{\\phi}}(\\boldsymbol{h}|\\boldsymbol{x})}\\left[\\log p(\\boldsymbol{x})\\right] && \\text{(Definition of Expectation)}\\\\\n",
" & = \\mathbb{E}_{q_{\\boldsymbol{\\phi}}(\\boldsymbol{h}|\\boldsymbol{x})}\\left[\\log\\frac{p(\\boldsymbol{x}, \\boldsymbol{h})}{p(\\boldsymbol{h}|\\boldsymbol{x})}\\right]&& \\\\\n",
" & = \\mathbb{E}_{q_{\\boldsymbol{\\phi}}(\\boldsymbol{h}|\\boldsymbol{x})}\\left[\\log\\frac{p(\\boldsymbol{x}, \\boldsymbol{h})q_{\\boldsymbol{\\phi}}(\\boldsymbol{h}|\\boldsymbol{x})}{p(\\boldsymbol{h}|\\boldsymbol{x})q_{\\boldsymbol{\\phi}}(\\boldsymbol{h}|\\boldsymbol{x})}\\right]&& \\text{(Multiply by $1 = \\frac{q_{\\boldsymbol{\\phi}}(\\boldsymbol{h}|\\boldsymbol{x})}{q_{\\boldsymbol{\\phi}}(\\boldsymbol{h}|\\boldsymbol{x})}$)}\\\\\n",
" & = \\mathbb{E}_{q_{\\boldsymbol{\\phi}}(\\boldsymbol{h}|\\boldsymbol{x})}\\left[\\log\\frac{p(\\boldsymbol{x}, \\boldsymbol{h})}{q_{\\boldsymbol{\\phi}}(\\boldsymbol{h}|\\boldsymbol{x})}\\right] + \\mathbb{E}_{q_{\\boldsymbol{\\phi}}(\\boldsymbol{h}|\\boldsymbol{x})}\\left[\\log\\frac{q_{\\boldsymbol{\\phi}}(\\boldsymbol{h}|\\boldsymbol{x})}{p(\\boldsymbol{h}|\\boldsymbol{x})}\\right] && \\text{(Split the Expectation)}\\\\\n",
" & = \\mathbb{E}_{q_{\\boldsymbol{\\phi}}(\\boldsymbol{h}|\\boldsymbol{x})}\\left[\\log\\frac{p(\\boldsymbol{x}, \\boldsymbol{h})}{q_{\\boldsymbol{\\phi}}(\\boldsymbol{h}|\\boldsymbol{x})}\\right] +\n",
- "\t KL(q_{\\boldsymbol{\\phi}}(\\boldsymbol{h}|\\boldsymbol{x})\\vert\\vert p(\\boldsymbol{h}|\\boldsymbol{x})) && \\text{(Definition of KL Divergence)}\\\\\n",
+ "\t D_{KL}(q_{\\boldsymbol{\\phi}}(\\boldsymbol{h}|\\boldsymbol{x})\\vert\\vert p(\\boldsymbol{h}|\\boldsymbol{x})) && \\text{(Definition of KL Divergence)}\\\\\n",
" & \\geq \\mathbb{E}_{q_{\\boldsymbol{\\phi}}(\\boldsymbol{h}|\\boldsymbol{x})}\\left[\\log\\frac{p(\\boldsymbol{x}, \\boldsymbol{h})}{q_{\\boldsymbol{\\phi}}(\\boldsymbol{h}|\\boldsymbol{x})}\\right] && \\text{(KL Divergence always $\\geq 0$)}\n",
"\\end{align*}\n",
"$$"
@@ -857,7 +942,7 @@
},
{
"cell_type": "markdown",
- "id": "f165c01c",
+ "id": "2840182f",
"metadata": {
"editable": true
},
@@ -875,7 +960,7 @@
},
{
"cell_type": "markdown",
- "id": "a5a5ba44",
+ "id": "a19f85d0",
"metadata": {
"editable": true
},
@@ -893,7 +978,7 @@
},
{
"cell_type": "markdown",
- "id": "5a0edb20",
+ "id": "8dd0ce7b",
"metadata": {
"editable": true
},
@@ -905,7 +990,7 @@
},
{
"cell_type": "markdown",
- "id": "066fb151",
+ "id": "0e1892c3",
"metadata": {
"editable": true
},
@@ -915,14 +1000,14 @@
"{\\mathbb{E}_{q_{\\boldsymbol{\\phi}}(\\boldsymbol{h}|\\boldsymbol{x})}\\left[\\log\\frac{p(\\boldsymbol{x}, \\boldsymbol{h})}{q_{\\boldsymbol{\\phi}}(\\boldsymbol{h}|\\boldsymbol{x})}\\right]}\n",
"&= {\\mathbb{E}_{q_{\\boldsymbol{\\phi}}(\\boldsymbol{h}|\\boldsymbol{x})}\\left[\\log\\frac{p_{\\boldsymbol{\\theta}}(\\boldsymbol{x}|\\boldsymbol{h})p(\\boldsymbol{h})}{q_{\\boldsymbol{\\phi}}(\\boldsymbol{h}|\\boldsymbol{x})}\\right]} && {\\text{(Chain Rule of Probability)}}\\\\\n",
"&= {\\mathbb{E}_{q_{\\boldsymbol{\\phi}}(\\boldsymbol{h}|\\boldsymbol{x})}\\left[\\log p_{\\boldsymbol{\\theta}}(\\boldsymbol{x}|\\boldsymbol{h})\\right] + \\mathbb{E}_{q_{\\boldsymbol{\\phi}}(\\boldsymbol{h}|\\boldsymbol{x})}\\left[\\log\\frac{p(\\boldsymbol{h})}{q_{\\boldsymbol{\\phi}}(\\boldsymbol{h}|\\boldsymbol{x})}\\right]} && {\\text{(Split the Expectation)}}\\\\\n",
- "&= \\underbrace{{\\mathbb{E}_{q_{\\boldsymbol{\\phi}}(\\boldsymbol{h}|\\boldsymbol{x})}\\left[\\log p_{\\boldsymbol{\\theta}}(\\boldsymbol{x}|\\boldsymbol{h})\\right]}}_\\text{reconstruction term} - \\underbrace{{KL(q_{\\boldsymbol{\\phi}}(\\boldsymbol{h}|\\boldsymbol{x})}\\vert\\vert{p(\\boldsymbol{h}))}}_\\text{prior matching term} && {\\text{(Definition of KL Divergence)}}\n",
+ "&= \\underbrace{{\\mathbb{E}_{q_{\\boldsymbol{\\phi}}(\\boldsymbol{h}|\\boldsymbol{x})}\\left[\\log p_{\\boldsymbol{\\theta}}(\\boldsymbol{x}|\\boldsymbol{h})\\right]}}_\\text{reconstruction term} - \\underbrace{{D_{KL}(q_{\\boldsymbol{\\phi}}(\\boldsymbol{h}|\\boldsymbol{x})}\\vert\\vert{p(\\boldsymbol{h}))}}_\\text{prior matching term} && {\\text{(Definition of KL Divergence)}}\n",
"\\end{align*}\n",
"$$"
]
},
{
"cell_type": "markdown",
- "id": "b1c188ac",
+ "id": "c87201d3",
"metadata": {
"editable": true
},
@@ -940,7 +1025,7 @@
},
{
"cell_type": "markdown",
- "id": "61b1725e",
+ "id": "bfab7537",
"metadata": {
"editable": true
},
@@ -960,7 +1045,7 @@
},
{
"cell_type": "markdown",
- "id": "d134dd70",
+ "id": "a3919404",
"metadata": {
"editable": true
},
@@ -972,7 +1057,7 @@
},
{
"cell_type": "markdown",
- "id": "c914ac07",
+ "id": "4c967472",
"metadata": {
"editable": true
},
@@ -987,7 +1072,7 @@
},
{
"cell_type": "markdown",
- "id": "182be149",
+ "id": "52a6f5da",
"metadata": {
"editable": true
},
@@ -999,21 +1084,21 @@
},
{
"cell_type": "markdown",
- "id": "d26c7329",
+ "id": "d57e59e7",
"metadata": {
"editable": true
},
"source": [
"$$\n",
"\\begin{align*}\n",
- " \\mathrm{argmax}_{\\boldsymbol{\\phi}, \\boldsymbol{\\theta}} \\mathbb{E}_{q_{\\boldsymbol{\\phi}}(\\boldsymbol{h}|\\boldsymbol{x})}\\left[\\log p_{\\boldsymbol{\\theta}}(\\boldsymbol{x}|\\boldsymbol{h})\\right] - KL(q_{\\boldsymbol{\\phi}}(\\boldsymbol{h}|\\boldsymbol{x})\\vert\\vert p(\\boldsymbol{h})) \\approx \\mathrm{argmax}_{\\boldsymbol{\\phi}, \\boldsymbol{\\theta}} \\sum_{l=1}^{L}\\log p_{\\boldsymbol{\\theta}}(\\boldsymbol{x}|\\boldsymbol{h}^{(l)}) - KL(q_{\\boldsymbol{\\phi}}(\\boldsymbol{h}|\\boldsymbol{x})\\vert\\vert p(\\boldsymbol{h}))\n",
+ " \\mathrm{argmax}_{\\boldsymbol{\\phi}, \\boldsymbol{\\theta}} \\mathbb{E}_{q_{\\boldsymbol{\\phi}}(\\boldsymbol{h}|\\boldsymbol{x})}\\left[\\log p_{\\boldsymbol{\\theta}}(\\boldsymbol{x}|\\boldsymbol{h})\\right] - D_{KL}(q_{\\boldsymbol{\\phi}}(\\boldsymbol{h}|\\boldsymbol{x})\\vert\\vert p(\\boldsymbol{h})) \\approx \\mathrm{argmax}_{\\boldsymbol{\\phi}, \\boldsymbol{\\theta}} \\sum_{l=1}^{L}\\log p_{\\boldsymbol{\\theta}}(\\boldsymbol{x}|\\boldsymbol{h}^{(l)}) - D_{KL}(q_{\\boldsymbol{\\phi}}(\\boldsymbol{h}|\\boldsymbol{x})\\vert\\vert p(\\boldsymbol{h}))\n",
"\\end{align*}\n",
"$$"
]
},
{
"cell_type": "markdown",
- "id": "2a89d0f3",
+ "id": "921da238",
"metadata": {
"editable": true
},
@@ -1023,7 +1108,7 @@
},
{
"cell_type": "markdown",
- "id": "7666826e",
+ "id": "00966db8",
"metadata": {
"editable": true
},
@@ -1040,7 +1125,7 @@
},
{
"cell_type": "markdown",
- "id": "a6ff50b2",
+ "id": "3b9fb44a",
"metadata": {
"editable": true
},
@@ -1057,7 +1142,7 @@
},
{
"cell_type": "markdown",
- "id": "1300c538",
+ "id": "93627bb2",
"metadata": {
"editable": true
},
@@ -1071,7 +1156,7 @@
},
{
"cell_type": "markdown",
- "id": "ebd27757",
+ "id": "82348034",
"metadata": {
"editable": true
},
@@ -1089,7 +1174,7 @@
},
{
"cell_type": "markdown",
- "id": "624b1963",
+ "id": "29eb0946",
"metadata": {
"editable": true
},
@@ -1101,7 +1186,7 @@
},
{
"cell_type": "markdown",
- "id": "eaf66572",
+ "id": "c96afc97",
"metadata": {
"editable": true
},
@@ -1115,7 +1200,7 @@
},
{
"cell_type": "markdown",
- "id": "f4d09b90",
+ "id": "9a77743c",
"metadata": {
"editable": true
},
@@ -1131,7 +1216,7 @@
},
{
"cell_type": "markdown",
- "id": "df030b4e",
+ "id": "8059d00f",
"metadata": {
"editable": true
},
@@ -1150,7 +1235,7 @@
},
{
"cell_type": "markdown",
- "id": "179d5913",
+ "id": "68e50592",
"metadata": {
"editable": true
},
@@ -1168,7 +1253,7 @@
},
{
"cell_type": "markdown",
- "id": "e954e4d0",
+ "id": "9fe155ca",
"metadata": {
"editable": true
},
@@ -1189,7 +1274,7 @@
},
{
"cell_type": "markdown",
- "id": "9912338d",
+ "id": "834bb255",
"metadata": {
"editable": true
},
@@ -1201,7 +1286,7 @@
},
{
"cell_type": "markdown",
- "id": "9fadae1d",
+ "id": "4f134c60",
"metadata": {
"editable": true
},
@@ -1226,7 +1311,7 @@
},
{
"cell_type": "markdown",
- "id": "d205f4c0",
+ "id": "db32ac68",
"metadata": {
"editable": true
},
@@ -1242,7 +1327,7 @@
},
{
"cell_type": "markdown",
- "id": "8afeef95",
+ "id": "ba47c9c1",
"metadata": {
"editable": true
},
@@ -1260,7 +1345,7 @@
},
{
"cell_type": "markdown",
- "id": "6e728830",
+ "id": "fb895c87",
"metadata": {
"editable": true
},
@@ -1272,7 +1357,7 @@
},
{
"cell_type": "markdown",
- "id": "5318fb35",
+ "id": "444ccf63",
"metadata": {
"editable": true
},
@@ -1288,7 +1373,7 @@
},
{
"cell_type": "markdown",
- "id": "70ed2bca",
+ "id": "1a3f4662",
"metadata": {
"editable": true
},
@@ -1300,7 +1385,7 @@
},
{
"cell_type": "markdown",
- "id": "b2f662ca",
+ "id": "7ddadfca",
"metadata": {
"editable": true
},
@@ -1311,7 +1396,7 @@
},
{
"cell_type": "markdown",
- "id": "ca69c67d",
+ "id": "a244b68b",
"metadata": {
"editable": true
},
@@ -1325,7 +1410,7 @@
},
{
"cell_type": "markdown",
- "id": "4d8a5c1f",
+ "id": "0dcc133a",
"metadata": {
"editable": true
},
@@ -1337,7 +1422,7 @@
},
{
"cell_type": "markdown",
- "id": "eaeafa28",
+ "id": "1fb4338d",
"metadata": {
"editable": true
},
@@ -1348,7 +1433,7 @@
},
{
"cell_type": "markdown",
- "id": "f35d0439",
+ "id": "c41ac754",
"metadata": {
"editable": true
},
@@ -1360,7 +1445,7 @@
},
{
"cell_type": "markdown",
- "id": "e9df164f",
+ "id": "8f6aa934",
"metadata": {
"editable": true
},
@@ -1383,7 +1468,7 @@
},
{
"cell_type": "markdown",
- "id": "ac025bad",
+ "id": "ab5ce09f",
"metadata": {
"editable": true
},
@@ -1400,7 +1485,7 @@
},
{
"cell_type": "markdown",
- "id": "1548d332",
+ "id": "30d20331",
"metadata": {
"editable": true
},
@@ -1413,7 +1498,7 @@
},
{
"cell_type": "markdown",
- "id": "415e7020",
+ "id": "a4aed469",
"metadata": {
"editable": true
},
@@ -1424,7 +1509,7 @@
},
{
"cell_type": "markdown",
- "id": "afa15d31",
+ "id": "7dc442ca",
"metadata": {
"editable": true
},
@@ -1438,7 +1523,7 @@
},
{
"cell_type": "markdown",
- "id": "db584443",
+ "id": "00bbdf9f",
"metadata": {
"editable": true
},
@@ -1451,7 +1536,7 @@
},
{
"cell_type": "markdown",
- "id": "65ecceb4",
+ "id": "f328f999",
"metadata": {
"editable": true
},
@@ -1463,7 +1548,7 @@
},
{
"cell_type": "markdown",
- "id": "93110eb4",
+ "id": "6a2e4da3",
"metadata": {
"editable": true
},
@@ -1478,7 +1563,7 @@
},
{
"cell_type": "markdown",
- "id": "f5789340",
+ "id": "bcca6687",
"metadata": {
"editable": true
},
@@ -1494,7 +1579,258 @@
},
{
"cell_type": "markdown",
- "id": "4bbe2c2b",
+ "id": "98cb8499",
+ "metadata": {
+ "editable": true
+ },
+ "source": [
+ "## Generative Adversarial Networks\n",
+ "Generative adversarial networks (GANs) have shown great results in\n",
+ "many generative tasks to replicate the real-world rich content such as\n",
+ "images, human language, and music. It is inspired by game theory: two\n",
+ "models, a generator and a discriminator, are competing with each other while\n",
+ "making each other stronger at the same time. However, it is rather\n",
+ "challenging to train a GANs model, \n",
+ "training instability or failure to converge.\n",
+ "\n",
+ "Generative adversarial networks consist of two models (in their simplest form as two opposing feed forward neural networks)\n",
+ "1. A discriminator $D$ estimates the probability of a given sample coming from the real dataset. It works as a critic and is optimized to tell the fake samples from the realo ones\n",
+ "\n",
+ "2. A generator $G$ outputs synthetic samples given a noise variable input $z$ ($z$ brings in potential output diversity). It is trained to capture the real data distribution in order to generate samples that can be as real as possible, or in other words, can trick the discriminator to offer a high probability.\n",
+ "\n",
+ "At the end of the training, the generator can be used to generate for\n",
+ "example new images. In this sense we have trained a model which can\n",
+ "produce new samples. We say that we have implicitely defined a\n",
+ "probability.\n",
+ "\n",
+ "These two models compete against each other during the training\n",
+ "process: the generator $G$ is trying hard to trick the discriminator,\n",
+ "while the critic model $D$ is trying hard not to be cheated. This\n",
+ "interesting zero-sum game between two models motivates both to improve\n",
+ "their functionalities.\n",
+ "\n",
+ "\n",
+ "\n",
+ "\n",
+ "\n",
+ "\n",
+ "\n",
+ "On one hand, we want to make sure the discriminator $D$'s decisions\n",
+ "over real data are accurate by maximizing $\\mathbb{E}_{x \\sim\n",
+ "p_{r}(x)} [\\log D(x)]$. Meanwhile, given a fake sample $G(z), z \\sim\n",
+ "p_z(z)$, the discriminator is expected to output a probability,\n",
+ "$D(G(z))$, close to zero by maximizing $\\mathbb{E}_{z \\sim p_{z}(z)}\n",
+ "[\\log (1 - D(G(z)))]$.\n",
+ "\n",
+ "On the other hand, the generator is trained to increase the chances of\n",
+ "$D$ producing a high probability for a fake example, thus to minimize\n",
+ "$\\mathbb{E}_{z \\sim p_{z}(z)} [\\log (1 - D(G(z)))]$.\n",
+ "\n",
+ "When combining both aspects together, $D$ and $G$ are playing a \\textit{minimax game} in which we should optimize the following loss function:"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e6af787f",
+ "metadata": {
+ "editable": true
+ },
+ "source": [
+ "$$\n",
+ "\\begin{aligned}\n",
+ "\\min_G \\max_D L(D, G) \n",
+ "& = \\mathbb{E}_{x \\sim p_{r}(x)} [\\log D(x)] + \\mathbb{E}_{z \\sim p_z(z)} [\\log(1 - D(G(z)))] \\\\\n",
+ "& = \\mathbb{E}_{x \\sim p_{r}(x)} [\\log D(x)] + \\mathbb{E}_{x \\sim p_g(x)} [\\log(1 - D(x)]\n",
+ "\\end{aligned}\n",
+ "$$"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d9a15e9a",
+ "metadata": {
+ "editable": true
+ },
+ "source": [
+ "where $\\mathbb{E}_{x \\sim p_{r}(x)} [\\log D(x)]$ has no impact on $G$ during gradient descent updates."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ee24c8ff",
+ "metadata": {
+ "editable": true
+ },
+ "source": [
+ "## Optimal value for $D$\n",
+ "\n",
+ "Now we have a well-defined loss function. Let's first examine what is the best value for $D$."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f811ecdf",
+ "metadata": {
+ "editable": true
+ },
+ "source": [
+ "$$\n",
+ "L(G, D) = \\int_x \\bigg( p_{r}(x) \\log(D(x)) + p_g (x) \\log(1 - D(x)) \\bigg) dx\n",
+ "$$"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "27776912",
+ "metadata": {
+ "editable": true
+ },
+ "source": [
+ "Since we are interested in what is the best value of $D(x)$ to maximize $L(G, D)$, let us label"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6d3b9d73",
+ "metadata": {
+ "editable": true
+ },
+ "source": [
+ "$$\n",
+ "\\tilde{x} = D(x), \n",
+ "A=p_{r}(x), \n",
+ "B=p_g(x)\n",
+ "$$"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b4911af2",
+ "metadata": {
+ "editable": true
+ },
+ "source": [
+ "And then what is inside the integral (we can safely ignore the integral because $x$ is sampled over all the possible values) is:"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "5162efda",
+ "metadata": {
+ "editable": true
+ },
+ "source": [
+ "$$\n",
+ "\\begin{align*}\n",
+ "f(\\tilde{x}) \n",
+ "& = A log\\tilde{x} + B log(1-\\tilde{x}) \\\\\n",
+ "\\frac{d f(\\tilde{x})}{d \\tilde{x}}\n",
+ "& = A \\frac{1}{ln10} \\frac{1}{\\tilde{x}} - B \\frac{1}{ln10} \\frac{1}{1 - \\tilde{x}} \\\\\n",
+ "& = \\frac{1}{ln10} (\\frac{A}{\\tilde{x}} - \\frac{B}{1-\\tilde{x}}) \\\\\n",
+ "& = \\frac{1}{ln10} \\frac{A - (A + B)\\tilde{x}}{\\tilde{x} (1 - \\tilde{x})} \\\\\n",
+ "\\end{align*}\n",
+ "$$"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "bee57995",
+ "metadata": {
+ "editable": true
+ },
+ "source": [
+ "Thus, set $\\frac{d f(\\tilde{x})}{d \\tilde{x}} = 0$, we get the best value of the discriminator: $D^*(x) = \\tilde{x}^* = \\frac{A}{A + B} = \\frac{p_{r}(x)}{p_{r}(x) + p_g(x)} \\in [0, 1]$.\n",
+ "Once the generator is trained to its optimal, $p_g$ gets very close to $p_{r}$. When $p_g = p_{r}$, $D^*(x)$ becomes $1/2$.\n",
+ "\n",
+ "When both $G$ and $D$ are at their optimal values, we have $p_g = p_{r}$ and $D^*(x) = 1/2$ and the loss function becomes:"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "27ffb307",
+ "metadata": {
+ "editable": true
+ },
+ "source": [
+ "$$\n",
+ "\\begin{align*}\n",
+ "L(G, D^*) \n",
+ "&= \\int_x \\bigg( p_{r}(x) \\log(D^*(x)) + p_g (x) \\log(1 - D^*(x)) \\bigg) dx \\\\\n",
+ "&= \\log \\frac{1}{2} \\int_x p_{r}(x) dx + \\log \\frac{1}{2} \\int_x p_g(x) dx \\\\\n",
+ "&= -2\\log2\n",
+ "\\end{align*}\n",
+ "$$"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "97400366",
+ "metadata": {
+ "editable": true
+ },
+ "source": [
+ "## What does the Loss Function Represent?\n",
+ "\n",
+ "The JS divergence between $p_{r}$ and $p_g$ can be computed as:"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f12302f0",
+ "metadata": {
+ "editable": true
+ },
+ "source": [
+ "$$\n",
+ "\\begin{align*}\n",
+ "D_{JS}(p_{r} \\| p_g) \n",
+ "=& \\frac{1}{2} D_{KL}(p_{r} || \\frac{p_{r} + p_g}{2}) + \\frac{1}{2} D_{KL}(p_{g} || \\frac{p_{r} + p_g}{2}) \\\\\n",
+ "=& \\frac{1}{2} \\bigg( \\log2 + \\int_x p_{r}(x) \\log \\frac{p_{r}(x)}{p_{r} + p_g(x)} dx \\bigg) + \\\\& \\frac{1}{2} \\bigg( \\log2 + \\int_x p_g(x) \\log \\frac{p_g(x)}{p_{r} + p_g(x)} dx \\bigg) \\\\\n",
+ "=& \\frac{1}{2} \\bigg( \\log4 + L(G, D^*) \\bigg)\n",
+ "\\end{align*}\n",
+ "$$"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b636eff3",
+ "metadata": {
+ "editable": true
+ },
+ "source": [
+ "Thus,"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c20b731a",
+ "metadata": {
+ "editable": true
+ },
+ "source": [
+ "$$\n",
+ "L(G, D^*) = 2D_{JS}(p_{r} \\| p_g) - 2\\log2\n",
+ "$$"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ba5eb4c6",
+ "metadata": {
+ "editable": true
+ },
+ "source": [
+ "Essentially the loss function of GAN quantifies the similarity between\n",
+ "the generative data distribution $p_g$ and the real sample\n",
+ "distribution $p_{r}$ by JS divergence when the discriminator is\n",
+ "optimal. The best $G^*$ that replicates the real data distribution\n",
+ "leads to the minimum $L(G^*, D^*) = -2\\log2$ which is aligned with\n",
+ "equations above."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6ccd7370",
"metadata": {
"editable": true
},
@@ -1513,7 +1849,7 @@
},
{
"cell_type": "markdown",
- "id": "c0522fb3",
+ "id": "7bf93525",
"metadata": {
"editable": true
},
@@ -1527,7 +1863,7 @@
},
{
"cell_type": "markdown",
- "id": "fc2b9ad9",
+ "id": "e1d7f1d1",
"metadata": {
"editable": true
},
@@ -1543,7 +1879,7 @@
},
{
"cell_type": "markdown",
- "id": "45c07637",
+ "id": "1c26f00a",
"metadata": {
"editable": true
},
@@ -1554,7 +1890,7 @@
{
"cell_type": "code",
"execution_count": 1,
- "id": "b4dc82a1",
+ "id": "3f9bdab9",
"metadata": {
"collapsed": false,
"editable": true
@@ -1579,7 +1915,7 @@
},
{
"cell_type": "markdown",
- "id": "f87b63c1",
+ "id": "280dbb97",
"metadata": {
"editable": true
},
@@ -1590,7 +1926,7 @@
{
"cell_type": "code",
"execution_count": 2,
- "id": "139f0467",
+ "id": "7e0df89a",
"metadata": {
"collapsed": false,
"editable": true
@@ -1639,7 +1975,7 @@
},
{
"cell_type": "markdown",
- "id": "2c1f7b6f",
+ "id": "3ec46b09",
"metadata": {
"editable": true
},
@@ -1650,7 +1986,7 @@
{
"cell_type": "code",
"execution_count": 3,
- "id": "ad486a71",
+ "id": "309e2e3a",
"metadata": {
"collapsed": false,
"editable": true
@@ -1679,7 +2015,7 @@
{
"cell_type": "code",
"execution_count": 4,
- "id": "632212e4",
+ "id": "646796ec",
"metadata": {
"collapsed": false,
"editable": true
@@ -1696,7 +2032,7 @@
},
{
"cell_type": "markdown",
- "id": "e204ed36",
+ "id": "7a0473da",
"metadata": {
"editable": true
},
@@ -1707,7 +2043,7 @@
{
"cell_type": "code",
"execution_count": 5,
- "id": "0ca2ac08",
+ "id": "b1f5b069",
"metadata": {
"collapsed": false,
"editable": true
@@ -1735,7 +2071,7 @@
},
{
"cell_type": "markdown",
- "id": "1791b464",
+ "id": "332100d5",
"metadata": {
"editable": true
},
@@ -1746,7 +2082,7 @@
{
"cell_type": "code",
"execution_count": 6,
- "id": "386b381a",
+ "id": "4d6be600",
"metadata": {
"collapsed": false,
"editable": true
@@ -1787,7 +2123,7 @@
},
{
"cell_type": "markdown",
- "id": "f28eaf09",
+ "id": "2f8ab669",
"metadata": {
"editable": true
},
@@ -1798,7 +2134,7 @@
{
"cell_type": "code",
"execution_count": 7,
- "id": "352db36d",
+ "id": "c858d332",
"metadata": {
"collapsed": false,
"editable": true
@@ -1823,7 +2159,7 @@
},
{
"cell_type": "markdown",
- "id": "d02e0c9a",
+ "id": "6cb82728",
"metadata": {
"editable": true
},
@@ -1834,7 +2170,7 @@
{
"cell_type": "code",
"execution_count": 8,
- "id": "3c157e73",
+ "id": "ec774a9e",
"metadata": {
"collapsed": false,
"editable": true
@@ -1916,7 +2252,7 @@
{
"cell_type": "code",
"execution_count": 9,
- "id": "639686f4",
+ "id": "67c4d3de",
"metadata": {
"collapsed": false,
"editable": true
@@ -1964,7 +2300,7 @@
},
{
"cell_type": "markdown",
- "id": "4a7df60c",
+ "id": "9156f147",
"metadata": {
"editable": true
},
@@ -1975,7 +2311,7 @@
{
"cell_type": "code",
"execution_count": 10,
- "id": "c575fd26",
+ "id": "42fef552",
"metadata": {
"collapsed": false,
"editable": true
@@ -2012,7 +2348,7 @@
{
"cell_type": "code",
"execution_count": 11,
- "id": "15299086",
+ "id": "4514f6ff",
"metadata": {
"collapsed": false,
"editable": true
@@ -2043,7 +2379,7 @@
},
{
"cell_type": "markdown",
- "id": "f4b675b8",
+ "id": "3f004fd2",
"metadata": {
"editable": true
},
@@ -2054,7 +2390,7 @@
{
"cell_type": "code",
"execution_count": 12,
- "id": "4f8dd76d",
+ "id": "3dcb337e",
"metadata": {
"collapsed": false,
"editable": true
@@ -2109,7 +2445,7 @@
},
{
"cell_type": "markdown",
- "id": "604b1a6b",
+ "id": "57bbb2c1",
"metadata": {
"editable": true
},
@@ -2126,7 +2462,7 @@
},
{
"cell_type": "markdown",
- "id": "69adfd20",
+ "id": "5dc8db83",
"metadata": {
"editable": true
},
@@ -2151,7 +2487,7 @@
},
{
"cell_type": "markdown",
- "id": "db15b3f4",
+ "id": "61f738e9",
"metadata": {
"editable": true
},
@@ -2169,7 +2505,7 @@
},
{
"cell_type": "markdown",
- "id": "8feb45ca",
+ "id": "58d4283f",
"metadata": {
"editable": true
},
@@ -2190,7 +2526,7 @@
},
{
"cell_type": "markdown",
- "id": "72d44bc3",
+ "id": "b39700a0",
"metadata": {
"editable": true
},
diff --git a/doc/pub/week15/pdf/week15.pdf b/doc/pub/week15/pdf/week15.pdf
index 4aa89051..15302978 100644
Binary files a/doc/pub/week15/pdf/week15.pdf and b/doc/pub/week15/pdf/week15.pdf differ
diff --git a/doc/src/week15/week15.do.txt b/doc/src/week15/week15.do.txt
index 1862c953..7ba1a482 100644
--- a/doc/src/week15/week15.do.txt
+++ b/doc/src/week15/week15.do.txt
@@ -713,6 +713,140 @@ another around in the parameter space indefinitely.
FIGURE: [figures/figure3.png, width=900 frac=1.0]
+!split
+===== Generative Adversarial Networks =====
+Generative adversarial networks (GANs) have shown great results in
+many generative tasks to replicate the real-world rich content such as
+images, human language, and music. It is inspired by game theory: two
+models, a generator and a discriminator, are competing with each other while
+making each other stronger at the same time. However, it is rather
+challenging to train a GANs model,
+training instability or failure to converge.
+
+
+Generative adversarial networks consist of two models (in their simplest form as two opposing feed forward neural networks)
+o A discriminator $D$ estimates the probability of a given sample coming from the real dataset. It works as a critic and is optimized to tell the fake samples from the realo ones
+o A generator $G$ outputs synthetic samples given a noise variable input $z$ ($z$ brings in potential output diversity). It is trained to capture the real data distribution in order to generate samples that can be as real as possible, or in other words, can trick the discriminator to offer a high probability.
+
+At the end of the training, the generator can be used to generate for
+example new images. In this sense we have trained a model which can
+produce new samples. We say that we have implicitely defined a
+probability.
+
+
+These two models compete against each other during the training
+process: the generator $G$ is trying hard to trick the discriminator,
+while the critic model $D$ is trying hard not to be cheated. This
+interesting zero-sum game between two models motivates both to improve
+their functionalities.
+
+#Given,
+
+#$p_{z}$ Data distribution over noise input $z$ Usually, just uniform.
+# $p_{g}$ The generator's distribution over data $x$
+#$p_{r}$ Data distribution over real sample $x$
+
+On one hand, we want to make sure the discriminator $D$'s decisions
+over real data are accurate by maximizing $\mathbb{E}_{x \sim
+p_{r}(x)} [\log D(x)]$. Meanwhile, given a fake sample $G(z), z \sim
+p_z(z)$, the discriminator is expected to output a probability,
+$D(G(z))$, close to zero by maximizing $\mathbb{E}_{z \sim p_{z}(z)}
+[\log (1 - D(G(z)))]$.
+
+On the other hand, the generator is trained to increase the chances of
+$D$ producing a high probability for a fake example, thus to minimize
+$\mathbb{E}_{z \sim p_{z}(z)} [\log (1 - D(G(z)))]$.
+
+When combining both aspects together, $D$ and $G$ are playing a \textit{minimax game} in which we should optimize the following loss function:
+
+!bt
+\[
+\begin{aligned}
+\min_G \max_D L(D, G)
+& = \mathbb{E}_{x \sim p_{r}(x)} [\log D(x)] + \mathbb{E}_{z \sim p_z(z)} [\log(1 - D(G(z)))] \\
+& = \mathbb{E}_{x \sim p_{r}(x)} [\log D(x)] + \mathbb{E}_{x \sim p_g(x)} [\log(1 - D(x)]
+\end{aligned}
+\]
+!et
+where $\mathbb{E}_{x \sim p_{r}(x)} [\log D(x)]$ has no impact on $G$ during gradient descent updates.
+
+!split
+===== Optimal value for $D$ =====
+
+Now we have a well-defined loss function. Let's first examine what is the best value for $D$.
+
+!bt
+\[
+L(G, D) = \int_x \bigg( p_{r}(x) \log(D(x)) + p_g (x) \log(1 - D(x)) \bigg) dx
+\]
+!et
+
+Since we are interested in what is the best value of $D(x)$ to maximize $L(G, D)$, let us label
+
+!bt
+\[
+\tilde{x} = D(x),
+A=p_{r}(x),
+B=p_g(x)
+\]
+!et
+
+And then what is inside the integral (we can safely ignore the integral because $x$ is sampled over all the possible values) is:
+
+!bt
+\begin{align*}
+f(\tilde{x})
+& = A log\tilde{x} + B log(1-\tilde{x}) \\
+\frac{d f(\tilde{x})}{d \tilde{x}}
+& = A \frac{1}{ln10} \frac{1}{\tilde{x}} - B \frac{1}{ln10} \frac{1}{1 - \tilde{x}} \\
+& = \frac{1}{ln10} (\frac{A}{\tilde{x}} - \frac{B}{1-\tilde{x}}) \\
+& = \frac{1}{ln10} \frac{A - (A + B)\tilde{x}}{\tilde{x} (1 - \tilde{x})} \\
+\end{align*}
+!et
+
+Thus, set $\frac{d f(\tilde{x})}{d \tilde{x}} = 0$, we get the best value of the discriminator: $D^*(x) = \tilde{x}^* = \frac{A}{A + B} = \frac{p_{r}(x)}{p_{r}(x) + p_g(x)} \in [0, 1]$.
+Once the generator is trained to its optimal, $p_g$ gets very close to $p_{r}$. When $p_g = p_{r}$, $D^*(x)$ becomes $1/2$.
+
+
+When both $G$ and $D$ are at their optimal values, we have $p_g = p_{r}$ and $D^*(x) = 1/2$ and the loss function becomes:
+
+!bt
+\begin{align*}
+L(G, D^*)
+&= \int_x \bigg( p_{r}(x) \log(D^*(x)) + p_g (x) \log(1 - D^*(x)) \bigg) dx \\
+&= \log \frac{1}{2} \int_x p_{r}(x) dx + \log \frac{1}{2} \int_x p_g(x) dx \\
+&= -2\log2
+\end{align*}
+!et
+
+!split
+===== What does the Loss Function Represent? =====
+
+The JS divergence between $p_{r}$ and $p_g$ can be computed as:
+
+!bt
+\begin{align*}
+D_{JS}(p_{r} \| p_g)
+=& \frac{1}{2} D_{KL}(p_{r} || \frac{p_{r} + p_g}{2}) + \frac{1}{2} D_{KL}(p_{g} || \frac{p_{r} + p_g}{2}) \\
+=& \frac{1}{2} \bigg( \log2 + \int_x p_{r}(x) \log \frac{p_{r}(x)}{p_{r} + p_g(x)} dx \bigg) + \\& \frac{1}{2} \bigg( \log2 + \int_x p_g(x) \log \frac{p_g(x)}{p_{r} + p_g(x)} dx \bigg) \\
+=& \frac{1}{2} \bigg( \log4 + L(G, D^*) \bigg)
+\end{align*}
+!et
+
+Thus,
+
+!bt
+\[
+L(G, D^*) = 2D_{JS}(p_{r} \| p_g) - 2\log2
+\]
+!et
+
+Essentially the loss function of GAN quantifies the similarity between
+the generative data distribution $p_g$ and the real sample
+distribution $p_{r}$ by JS divergence when the discriminator is
+optimal. The best $G^*$ that replicates the real data distribution
+leads to the minimum $L(G^*, D^*) = -2\log2$ which is aligned with
+equations above.
!split
@@ -1210,460 +1344,5 @@ of arbitrary form.
-Generative adversarial networks (GANs) have shown great results in
-many generative tasks to replicate the real-world rich content such as
-images, human language, and music. It is inspired by game theory: two
-models, a generator and a discriminator, are competing with each other while
-making each other stronger at the same time. However, it is rather
-challenging to train a GANs model,
-training instability or failure to converge.
-
-
-
-\section{Generative Adversarial Network}
-
-GAN consists of two models:
-\begin{itemize}
- \item A discriminator $D$ estimates the probability of a given sample coming from the real dataset. It works as a critic and is optimized to tell the fake samples from the real ones.
- \item A generator $G$ outputs synthetic samples given a noise variable input $z$ ($z$ brings in potential output diversity). It is trained to capture the real data distribution so that its generative samples can be as real as possible, or in other words, can trick the discriminator to offer a high probability.
-\end{itemize}
-
-
-These two models compete against each other during the training process: the generator $G$ is trying hard to trick the discriminator, while the critic model $D$ is trying hard not to be cheated. This interesting zero-sum game between two models motivates both to improve their functionalities.
-
-Given,
-
-\begin{table}[h!]
- \centering
- \begin{tabular}{c|l|l}
- \hline
- \textbf{Symbol} & \textbf{Meaning} & \textbf{Notes}\\
- \hline
- $p_{z}$ & Data distribution over noise input $z$ & Usually, just uniform. \\
- $p_{g}$ & The generator's distribution over data $x$ & \\
- $p_{r}$ & Data distribution over real sample $x$ & \\
- \hline
- \end{tabular}
-\end{table}
-
-On one hand, we want to make sure the discriminator $D$'s decisions over real data are accurate by maximizing $\mathbb{E}_{x \sim p_{r}(x)} [\log D(x)]$. Meanwhile, given a fake sample $G(z), z \sim p_z(z)$, the discriminator is expected to output a probability, $D(G(z))$, close to zero by maximizing $\mathbb{E}_{z \sim p_{z}(z)} [\log (1 - D(G(z)))]$.
-
-On the other hand, the generator is trained to increase the chances of $D$ producing a high probability for a fake example, thus to minimize $\mathbb{E}_{z \sim p_{z}(z)} [\log (1 - D(G(z)))]$.
-
-When combining both aspects together, $D$ and $G$ are playing a \textit{minimax game} in which we should optimize the following loss function:
-
-\[
-\begin{aligned}
-\min_G \max_D L(D, G)
-& = \mathbb{E}_{x \sim p_{r}(x)} [\log D(x)] + \mathbb{E}_{z \sim p_z(z)} [\log(1 - D(G(z)))] \\
-& = \mathbb{E}_{x \sim p_{r}(x)} [\log D(x)] + \mathbb{E}_{x \sim p_g(x)} [\log(1 - D(x)]
-\end{aligned}
-\]
-
-where $\mathbb{E}_{x \sim p_{r}(x)} [\log D(x)]$ has no impact on $G$ during gradient descent updates.
-
-
-\subsection{What is the Optimal Value for D?}
-
-Now we have a well-defined loss function. Let's first examine what is the best value for $D$.
-
-\[
-L(G, D) = \int_x \bigg( p_{r}(x) \log(D(x)) + p_g (x) \log(1 - D(x)) \bigg) dx
-\]
-
-Since we are interested in what is the best value of $D(x)$ to maximize $L(G, D)$, let us label
-
-\[
-\tilde{x} = D(x),
-A=p_{r}(x),
-B=p_g(x)
-\]
-
-And then what is inside the integral (we can safely ignore the integral because $x$ is sampled over all the possible values) is:
-
-\begin{align*}
-f(\tilde{x})
-& = A log\tilde{x} + B log(1-\tilde{x}) \\
-\frac{d f(\tilde{x})}{d \tilde{x}}
-& = A \frac{1}{ln10} \frac{1}{\tilde{x}} - B \frac{1}{ln10} \frac{1}{1 - \tilde{x}} \\
-& = \frac{1}{ln10} (\frac{A}{\tilde{x}} - \frac{B}{1-\tilde{x}}) \\
-& = \frac{1}{ln10} \frac{A - (A + B)\tilde{x}}{\tilde{x} (1 - \tilde{x})} \\
-\end{align*}
-
-
-Thus, set $\frac{d f(\tilde{x})}{d \tilde{x}} = 0$, we get the best value of the discriminator: $D^*(x) = \tilde{x}^* = \frac{A}{A + B} = \frac{p_{r}(x)}{p_{r}(x) + p_g(x)} \in [0, 1]$.
-Once the generator is trained to its optimal, $p_g$ gets very close to $p_{r}$. When $p_g = p_{r}$, $D^*(x)$ becomes $1/2$.
-
-
-\subsection{What is the Global Optimal? }
-
-When both $G$ and $D$ are at their optimal values, we have $p_g = p_{r}$ and $D^*(x) = 1/2$ and the loss function becomes:
-
-\begin{align*}
-L(G, D^*)
-&= \int_x \bigg( p_{r}(x) \log(D^*(x)) + p_g (x) \log(1 - D^*(x)) \bigg) dx \\
-&= \log \frac{1}{2} \int_x p_{r}(x) dx + \log \frac{1}{2} \int_x p_g(x) dx \\
-&= -2\log2
-\end{align*}
-
-
-\subsection{What does the Loss Function Represent?}
-
-According to the formula listed in Sec.~\ref{sec:kl_and_js}, JS divergence between $p_{r}$ and $p_g$ can be computed as:
-
-\begin{align*}
-D_{JS}(p_{r} \| p_g)
-=& \frac{1}{2} D_{KL}(p_{r} || \frac{p_{r} + p_g}{2}) + \frac{1}{2} D_{KL}(p_{g} || \frac{p_{r} + p_g}{2}) \\
-=& \frac{1}{2} \bigg( \log2 + \int_x p_{r}(x) \log \frac{p_{r}(x)}{p_{r} + p_g(x)} dx \bigg) + \\& \frac{1}{2} \bigg( \log2 + \int_x p_g(x) \log \frac{p_g(x)}{p_{r} + p_g(x)} dx \bigg) \\
-=& \frac{1}{2} \bigg( \log4 + L(G, D^*) \bigg)
-\end{align*}
-
-Thus,
-
-\[
-L(G, D^*) = 2D_{JS}(p_{r} \| p_g) - 2\log2
-\]
-
-Essentially the loss function of GAN quantifies the similarity between the generative data distribution $p_g$ and the real sample distribution $p_{r}$ by JS divergence when the discriminator is optimal. The best $G^*$ that replicates the real data distribution leads to the minimum $L(G^*, D^*) = -2\log2$ which is aligned with equations above.
-
-
-\textbf{Other Variations of GAN}: There are many variations of GANs in different contexts or designed for different tasks. For example, for semi-supervised learning, one idea is to update the discriminator to output real class labels, $1, \dots, K-1$, as well as one fake class label $K$. The generator model aims to trick the discriminator to output a classification label smaller than $K$.
-
-
-\section{Problems in GANs}
-
-Although GAN has shown great success in the realistic image generation, the training is not easy; The process is known to be slow and unstable.
-
-
-\subsection{Hard to Achieve Nash Equilibrium}
-
-\cite{salimans2016nips} discussed the problem with GAN's gradient-descent-based training procedure. Two models are trained simultaneously to find a Nash equilibrium to a two-player non-cooperative game. However, each model updates its cost independently with no respect to another player in the game. Updating the gradient of both models concurrently cannot guarantee a convergence.
-
-Let's check out a simple example to better understand why it is difficult to find a Nash equilibrium in an non-cooperative game. Suppose one player takes control of $x$ to minimize $f_1(x) = xy$, while at the same time the other player constantly updates $y$ to minimize $f_2(y) = -xy$.
-
-Because $\frac{\partial f_1}{\partial x} = y$ and $\frac{\partial f_2}{\partial y} = -x$, we update $x$ with $x-\eta \cdot y$ and $y$ with $y+ \eta \cdot x$ simultaneously in one iteration, where $\eta$ is the learning rate. Once $x$ and $y$ have different signs, every following gradient update causes huge oscillation and the instability gets worse in time, as shown in Fig. 3.
-
-\begin{figure}[!htb]
- \centering
- \includegraphics[width=0.8\linewidth]{nash_equilibrium.png}
- \caption{A simulation of our example for updating $x$ to minimize $xy$ and updating $y$ to minimize $-xy$. The learning rate $\eta = 0.1$. With more iterations, the oscillation grows more and more unstable.}
- \label{fig:fig3}
-\end{figure}
-
-
-\subsection{Low Dimensional Supports}
-\label{sec:low_dimensional_supports}
-
-\begin{table}[h!]
- \centering
- \begin{tabular}{c|p{10cm}}
- \hline
- \textbf{Term} & \textbf{Explanation} \\
- \hline
- Manifold & A topological space that locally resembles Euclidean space near each point. Precisely, when this Euclidean space is of dimension $n$, the manifold is referred as $n$-manifold. \\
- Support & A real-valued function $f$ is the subset of the domain containing those elements which are not mapped to zero.\\
- \hline
- \end{tabular}
-\end{table}
-
-\cite{arjovsky2017} discussed the problem of the supports of $p_r$ and $p_g$ lying on low dimensional manifolds and how it contributes to the instability of GAN training thoroughly.
-
-The dimensions of many real-world datasets, as represented by $p_r$, only appear to be \textit{artificially high}. They have been found to concentrate in a lower dimensional manifold. This is actually the fundamental assumption for \textit{Manifold Learning}. Thinking of the real world images, once the theme or the contained object is fixed, the images have a lot of restrictions to follow, i.e., a dog should have two ears and a tail, and a skyscraper should have a straight and tall body, etc. These restrictions keep images away from the possibility of having a high-dimensional free form.
-
-$p_g$ lies in a low dimensional manifolds, too. Whenever the generator is asked to a much larger image like 64x64 given a small dimension, such as 100, noise variable input $z$, the distribution of colors over these 4096 pixels has been defined by the small 100-dimension random number vector and can hardly fill up the whole high dimensional space.
-
-Because both $p_g$ and $p_r$ rest in low dimensional manifolds, they are almost certainly gonna be disjoint (See Fig.~\ref{fig:fig4}). When they have disjoint supports, we are always capable of finding a perfect discriminator that separates real and fake samples 100\% correctly.~\cite{arjovsky2017}
-
-
-\begin{figure}[!htb]
- \centering
- \includegraphics[width=\linewidth]{low_dim_manifold.png}
- \caption{Low dimensional manifolds in high dimension space can hardly have overlaps. (Left) Two lines in a three-dimension space. (Right) Two surfaces in a three-dimension space.}
- \label{fig:fig4}
-\end{figure}
-
-
-\subsection{Vanishing Gradient}
-
-When the discriminator is perfect, we are guaranteed with $D(x) = 1, \forall x \in p_r$ and $D(x) = 0, \forall x \in p_g$. Therefore the loss function $L$ falls to zero and we end up with no gradient to update the loss during learning iterations. Fig. 5 demonstrates an experiment when the discriminator gets better, the gradient vanishes fast.
-
-
-\begin{figure}[!htb]
- \centering
- \includegraphics[width=0.6\linewidth]{GAN_vanishing_gradient.png}
- \caption{First, a DCGAN is trained for 1, 10 and 25 epochs. Then, with the generator \textit{fixed}, a discriminator is trained from scratch and measure the gradients with the original cost function. We see the gradient norms \textit{decay quickly} (in log scale), in the best case 5 orders of magnitude after 4000 discriminator iterations. (Image source:~\cite{arjovsky2017}).}
- \label{fig:fig5}
-\end{figure}
-
-
-As a result, training a GAN faces an dilemma:
-\begin{itemize}
- \item If the discriminator behaves badly, the generator does not have accurate feedback and the loss function cannot represent the reality.
- \item If the discriminator does a great job, the gradient of the loss function drops down to close to zero and the learning becomes super slow or even jammed.
-\end{itemize}
-
-This dilemma clearly is capable to make the GAN training very tough.
-
-
-\subsection{Mode Collapse}
-
-During the training, the generator may collapse to a setting where it always produces same outputs. This is a common failure case for GANs, commonly referred to as \textit{Mode Collapse}. Even though the generator might be able to trick the corresponding discriminator, it fails to learn to represent the complex real-world data distribution and gets stuck in a small space with extremely low variety.
-
-\begin{figure}[!htb]
- \centering
- \includegraphics[width=\linewidth]{mode_collapse.png}
- \caption{A DCGAN model is trained with an MLP network with 4 layers, 512 units and ReLU activation function, configured to lack a strong inductive bias for image generation. The results shows a significant degree of mode collapse. (Image source:~\cite{wgan2017}).}
- \label{fig:fig6}
-\end{figure}
-
-
-\subsection{Lack of a Proper Evaluation Metric}
-
-Generative adversarial networks are not born with a good objection function that can inform us the training progress. Without a good evaluation metric, it is like working in the dark. No good sign to tell when to stop; No good indicator to compare the performance of multiple models.
-
-
-\section{Improved GAN Training}
-
-The following suggestions are proposed to help stabilize and improve the training of GANs.
-
-First five methods are practical techniques to achieve faster convergence of GAN training~\cite{salimans2016nips}. The last two are proposed in~\cite{arjovsky2017} to solve the problem of disjoint distributions.
-
-(1) \textbf{Feature Matching}
-
-Feature matching suggests to optimize the discriminator to inspect whether the generator's output matches expected statistics of the real samples. In such a scenario, the new loss function is defined as $\| \mathbb{E}_{x \sim p_r} f(x) - \mathbb{E}_{z \sim p_z(z)}f(G(z)) \|_2^2 $, where $f(x)$ can be any computation of statistics of features, such as mean or median.
-
-(2) \textbf{Minibatch Discrimination}
-
-With minibatch discrimination, the discriminator is able to digest the relationship between training data points in one batch, instead of processing each point independently.
-
-In one minibatch, we approximate the closeness between every pair of samples, $c(x_i, x_j)$, and get the overall summary of one data point by summing up how close it is to other samples in the same batch, $o(x_i) = \sum_{j} c(x_i, x_j)$. Then $o(x_i)$ is explicitly added to the input of the model.
-
-(3) \textbf{Historical Averaging}
-
-For both models, add $ \| \Theta - \frac{1}{t} \sum_{i=1}^t \Theta_i \|^2 $ into the loss function, where $\Theta$ is the model parameter and $\Theta_i$ is how the parameter is configured at the past training time $i$. This addition piece penalizes the training speed when $\Theta$ is changing too dramatically in time.
-
-(4) \textbf{One-sided Label Smoothing}
-
-When feeding the discriminator, instead of providing 1 and 0 labels, use soften values such as 0.9 and 0.1. It is shown to reduce the networks' vulnerability.
-
-(5) \textbf{Virtual Batch Normalization (VBN)}
-
-Each data sample is normalized based on a fixed batch (\textit{"reference batch"}) of data rather than within its minibatch. The reference batch is chosen once at the beginning and stays the same through the training.
-
-(6) \textbf{Adding Noises}
-
-Based on the discussion in Sec.~\ref{sec:low_dimensional_supports}, we now know $p_r$ and $p_g$ are disjoint in a high dimensional space and it causes the problem of vanishing gradient. To artificially "spread out" the distribution and to create higher chances for two probability distributions to have overlaps, one solution is to add continuous noises onto the inputs of the discriminator $D$.
-
-(7) \textbf{Use Better Metric of Distribution Similarity}
-
-The loss function of the vanilla GAN measures the JS divergence between the distributions of $p_r$ and $p_g$. This metric fails to provide a meaningful value when two distributions are disjoint.
-
-Wasserstein metric is proposed to replace JS divergence because it has a much smoother value space. See more in the next section.
-
-
-\section{Wasserstein GAN (WGAN)}
-
-\subsection{What is Wasserstein Distance?}
-
-\textit{Wasserstein Distance} is a measure of the distance between two probability distributions.
-It is also called \textit{Earth Mover's distance}, short for EM distance, because informally it can be interpreted as the minimum energy cost of moving and transforming a pile of dirt in the shape of one probability distribution to the shape of the other distribution. The cost is quantified by: the amount of dirt moved x the moving distance.
-
-Let us first look at a simple case where the probability domain is discrete. For example, suppose we have two distributions $P$ and $Q$, each has four piles of dirt and both have ten shovelfuls of dirt in total. The numbers of shovelfuls in each dirt pile are assigned as follows:
-
-\[
-P_1 = 3, P_2 = 2, P_3 = 1, P_4 = 4\\
-Q_1 = 1, Q_2 = 2, Q_3 = 4, Q_4 = 3
-\]
-
-In order to change $P$ to look like $Q$, as illustrated in Fig.~\ref{fig:fig7}, we:
-\begin{itemize}
- \item First move 2 shovelfuls from $P_1$ to $P_2$ => $(P_1, Q_1)$ match up.
- \item Then move 2 shovelfuls from $P_2$ to $P_3$ => $(P_2, Q_2)$ match up.
- \item Finally move 1 shovelfuls from $Q_3$ to $Q_4$ => $(P_3, Q_3)$ and $(P_4, Q_4)$ match up.
-\end{itemize}
-
-\begin{figure}[!htb]
- \centering
- \includegraphics[width=\linewidth]{EM_distance_discrete.png}
- \caption{Step-by-step plan of moving dirt between piles in $P$ and $Q$ to make them match.}
- \label{fig:fig7}
-\end{figure}
-
-If we label the cost to pay to make $P_i$ and $Q_i$ match as $\delta_i$, we would have $\delta_{i+1} = \delta_i + P_i - Q_i$ and in the example:
-
-\begin{align*}
-\delta_0 &= 0\\
-\delta_1 &= 0 + 3 - 1 = 2\\
-\delta_2 &= 2 + 2 - 2 = 2\\
-\delta_3 &= 2 + 1 - 4 = -1\\
-\delta_4 &= -1 + 4 - 3 = 0
-\end{align*}
-
-Finally the Earth Mover's distance is $W = \sum \vert \delta_i \vert = 5$.
-
-
-When dealing with the continuous probability domain, the distance formula becomes:
-
-\[
-W(p_r, p_g) = \inf_{\gamma \sim \Pi(p_r, p_g)} \mathbb{E}_{(x, y) \sim \gamma}[\| x-y \|]
-\]
-
-
-In the formula above, $\Pi(p_r, p_g)$ is the set of all possible joint probability distributions between $p_r$ and $p_g$. One joint distribution $\gamma \in \Pi(p_r, p_g)$ describes one dirt transport plan, same as the discrete example above, but in the continuous probability space. Precisely $\gamma(x, y)$ states the percentage of dirt should be transported from point $x$ to $y$ so as to make $x$ follows the same probability distribution of $y$. That's why the marginal distribution over $x$ adds up to $p_g$, $\sum_{x} \gamma(x, y) = p_g(y)$ (Once we finish moving the planned amount of dirt from every possible $x$ to the target $y$, we end up with exactly what $y$ has according to $p_g$.) and vice versa $\sum_{y} \gamma(x, y) = p_r(x)$.
-
-When treating $x$ as the starting point and $y$ as the destination, the total amount of dirt moved is $\gamma(x, y)$ and the traveling distance is $\| x-y \|$ and thus the cost is $\gamma(x, y) \cdot \| x-y \|$. The expected cost averaged across all the $(x,y)$ pairs can be easily computed as:
-
-\[
-\sum_{x, y} \gamma(x, y) \| x-y \|
-= \mathbb{E}_{x, y \sim \gamma} \| x-y \|
-\]
-
-Finally, we take the minimum one among the costs of all dirt moving solutions as the EM distance. In the definition of Wasserstein distance, the $\inf$ (infimum, also known as *greatest lower bound*) indicates that we are only interested in the smallest cost.
-
-
-\subsection{Why Wasserstein is better than JS or KL Divergence?}
-
-Even when two distributions are located in lower dimensional manifolds without overlaps, Wasserstein distance can still provide a meaningful and smooth representation of the distance in-between.
-
-The WGAN paper exemplified the idea with a simple example.
-
-Suppose we have two probability distributions, $P$ and $Q$:
-
-\[
-\forall (x, y) \in P, x = 0 \text{ and } y \sim U(0, 1)\\
-\forall (x, y) \in Q, x = \theta, 0 \leq \theta \leq 1 \text{ and } y \sim U(0, 1)\\
-\]
-
-
-\begin{figure}[!htb]
- \centering
- \includegraphics[width=0.7\linewidth]{wasserstein_simple_example.png}
- \caption{There is no overlap between $P$ and $Q$ when $\theta \neq 0$.}
- \label{fig:fig8}
-\end{figure}
-
-When $\theta \neq 0$:
-
-
-\begin{align*}
-D_{KL}(P \| Q) &= \sum_{x=0, y \sim U(0, 1)} 1 \cdot \log\frac{1}{0} = +\infty \\
-D_{KL}(Q \| P) &= \sum_{x=\theta, y \sim U(0, 1)} 1 \cdot \log\frac{1}{0} = +\infty \\
-D_{JS}(P, Q) &= \frac{1}{2}(\sum_{x=0, y \sim U(0, 1)} 1 \cdot \log\frac{1}{1/2} + \sum_{x=0, y \sim U(0, 1)} 1 \cdot \log\frac{1}{1/2}) = \log 2\\
-W(P, Q) &= |\theta|
-\end{align*}
-
-But when $\theta = 0$, two distributions are fully overlapped:
-
-\begin{align*}
-D_{KL}(P \| Q) &= D_{KL}(Q \| P) = D_{JS}(P, Q) = 0\\
-W(P, Q) &= 0 = \lvert \theta \rvert
-\end{align*}
-
-
-$D_{KL}$ gives us infinity when two distributions are disjoint. The value of $D_{JS}$ has sudden jump, not differentiable at $\theta = 0$. Only Wasserstein metric provides a smooth measure, which is super helpful for a stable learning process using gradient descents.
-
-
-\subsection{Use Wasserstein Distance as GAN Loss Function}
-
-It is intractable to exhaust all the possible joint distributions in $\Pi(p_r, p_g)$ to compute $\inf_{\gamma \sim \Pi(p_r, p_g)}$. Thus the authors proposed a smart transformation of the formula based on the Kantorovich-Rubinstein duality to:
-
-\[
-W(p_r, p_g) = \frac{1}{K} \sup_{\| f \|_L \leq K} \mathbb{E}_{x \sim p_r}[f(x)] - \mathbb{E}_{x \sim p_g}[f(x)]
-\]
-
-where $\sup$ (supremum) is the opposite of $inf$ (infimum); we want to measure the least upper bound or, in even simpler words, the maximum value.
-
-
-\subsubsection{Lipschitz Continuity}
-
-The function $f$ in the new form of Wasserstein metric is demanded to satisfy $\| f \|_L \leq K$, meaning it should be \textit{K-Lipschitz continuous}.
-
-A real-valued function $f: \mathbb{R} \rightarrow \mathbb{R}$ is called $K$-Lipschitz continuous if there exists a real constant $K \geq 0$ such that, for all $x_1, x_2 \in \mathbb{R}$,
-
-\[
-\lvert f(x_1) - f(x_2) \rvert \leq K \lvert x_1 - x_2 \rvert
-\]
-
-Here $K$ is known as a Lipschitz constant for function $f(.)$. Functions that are everywhere continuously differentiable is Lipschitz continuous, because the derivative, estimated as $\frac{\lvert f(x_1) - f(x_2) \rvert}{\lvert x_1 - x_2 \rvert}$, has bounds. However, a Lipschitz continuous function may not be everywhere differentiable, such as $f(x) = \lvert x \rvert$.
-
-Explaining how the transformation happens on the Wasserstein distance formula is worthy of a long post by itself, so I skip the details here. If you are interested in how to compute Wasserstein metric using linear programming, or how to transfer Wasserstein metric into its dual form according to the Kantorovich-Rubinstein Duality, read this awesome \href{https://vincentherrmann.github.io/blog/wasserstein/}{post}.
-
-
-\subsubsection{Wasserstein Loss Function}
-
-Suppose this function $f$ comes from a family of K-Lipschitz continuous functions, $\{ f_w \}_{w \in W}$, parameterized by $w$. In the modified Wasserstein-GAN, the "discriminator" model is used to learn $w$ to find a good $f_w$ and the loss function is configured as measuring the Wasserstein distance between $p_r$ and $p_g$.
-
-\[
-L(p_r, p_g) = W(p_r, p_g) = \max_{w \in W} \mathbb{E}_{x \sim p_r}[f_w(x)] - \mathbb{E}_{z \sim p_r(z)}[f_w(g_\theta(z))]
-\]
-Thus the "discriminator" is not a direct critic of telling the fake samples apart from the real ones anymore. Instead, it is trained to learn a $K$-Lipschitz continuous function to help compute Wasserstein distance. As the loss function decreases in the training, the Wasserstein distance gets smaller and the generator model's output grows closer to the real data distribution.
-
-One big problem is to maintain the $K$-Lipschitz continuity of $f_w$ during the training in order to make everything work out. The paper presents a simple but very practical trick: After every gradient update, clamp the weights $w$ to a small window, such as $[-0.01, 0.01]$, resulting in a compact parameter space $W$ and thus $f_w$ obtains its lower and upper bounds to preserve the Lipschitz continuity.
-
-
-\begin{figure}[!htb]
- \centering
- \includegraphics[width=0.85\linewidth]{WGAN_algorithm.png}
- \caption{Algorithm of Wasserstein generative adversarial network. (Image source:~\cite{wgan2017})}
- \label{fig:fig9}
-\end{figure}
-
-
-Compared to the original GAN algorithm, the WGAN undertakes the following changes:
-\begin{itemize}
- \item After every gradient update on the critic function, clamp the weights to a small fixed range, $[-c, c]$.
- \item Use a new loss function derived from the Wasserstein distance, no logarithm anymore. The "discriminator" model does not play as a direct critic but a helper for estimating the Wasserstein metric between real and generated data distribution.
- \item Empirically the authors recommended RMSProp optimizer on the critic, rather than a momentum based optimizer such as Adam which could cause instability in the model training. I haven't seen clear theoretical explanation on this point through.
-\end{itemize}
-
-Sadly, Wasserstein GAN is not perfect. Even the authors of the original WGAN paper mentioned that \textit{"Weight clipping is a clearly terrible way to enforce a Lipschitz constraint"}. WGAN still suffers from unstable training, slow convergence after weight clipping (when clipping window is too large), and vanishing gradients (when clipping window is too small).
-
-Some improvement, precisely replacing weight clipping with \textit{gradient penalty}, has been discussed in~\cite{wgan2017improve}.
-
-
-\bibliographystyle{plain}
-%\bibliography{../references}
-
-\begin{thebibliography}{1}
-
- \bibitem{gan2014}
- Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley,
- Sherjil Ozair, Aaron Courville, and Yoshua Bengio.
- \newblock Generative adversarial nets.
- \newblock In {\em NIPS}, pages 2672--2680. 2014.
-
- \bibitem{gan2015train}
- Ferenc Huszár.
- \newblock How (not) to train your generative model: Scheduled sampling,
- likelihood, adversary?
- \newblock {\em arXiv:1511.05101}, 2015.
-
- \bibitem{salimans2016nips}
- Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and
- Xi~Chen.
- \newblock Improved techniques for training gans.
- \newblock {\em NIPS}, 2016.
-
- \bibitem{arjovsky2017}
- Martin Arjovsky and Léon Bottou.
- \newblock Towards principled methods for training generative adversarial
- networks.
- \newblock {\em ICML}, 2017.
-
- \bibitem{wgan2017}
- Martin Arjovsky, Soumith Chintala, and Léon Bottou.
- \newblock Wasserstein gan.
- \newblock {\em arXiv:1701.07875}, 2017.
-
- \bibitem{wgan2017improve}
- Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron
- Courville.
- \newblock Improved training of wasserstein gans.
- \newblock {\em arXiv:1704.00028}, 2017.
-
-\end{thebibliography}
-
-
-\end{document}