Skip to content

Commit

Permalink
update
Browse files Browse the repository at this point in the history
  • Loading branch information
mhjensen committed May 1, 2024
1 parent d0e6c7e commit 0f25c0a
Show file tree
Hide file tree
Showing 8 changed files with 1,425 additions and 624 deletions.
211 changes: 205 additions & 6 deletions doc/pub/week15/html/week15-bs.html

Large diffs are not rendered by default.

211 changes: 205 additions & 6 deletions doc/pub/week15/html/week15-reveal.html
Original file line number Diff line number Diff line change
Expand Up @@ -203,7 +203,7 @@ <h2 id="plans-for-the-week-of-april-29-may-3-2024">Plans for the week of April 2
<ol>
<p><li> Summary of Variational Autoencoders</li>
<p><li> Generative Adversarial Networks (GANs), see <a href="https://lilianweng.github.io/posts/2017-08-20-gan/" target="_blank"><tt>https://lilianweng.github.io/posts/2017-08-20-gan/</tt></a> for nice overview</li>
<p><li> Start discussion of diffusion models</li>
<p><li> Start discussion of diffusion models, motivation</li>
<p><li> <a href="https://youtu.be/Cg8n9aWwHuU" target="_blank">Video of lecture</a></li>
<p><li> <a href="https://github.com/CompPhysics/AdvancedMachineLearning/blob/main/doc/HandwrittenNotes/2024/NotesApril30.pdf" target="_blank">Whiteboard notes</a></li>
</ol>
Expand Down Expand Up @@ -477,6 +477,54 @@ <h2 id="final-expression">Final expression </h2>
</p>
</section>

<section>
<h2 id="kullback-leibler-divergence">Kullback-Leibler divergence </h2>

<p>Before we continue, we need to remind ourselves about the
Kullback-Leibler divergence introduced earlier. This will also allow
us to introduce another measure used in connection with the training
of Generative Adversarial Networks, the so-called Jensen-Shannon divergence..
These metrics are useful for quantifying the similarity between two probability distributions.
</p>

<p>The Kullback&#8211;Leibler (KL) divergence, labeled \( D_{KL} \), measures how one probability distribution \( p \) diverges from a second expected probability distribution \( q \),
that is
</p>
<p>&nbsp;<br>
$$
D_{KL}(p \| q) = \int_x p(x) \log \frac{p(x)}{q(x)} dx.
$$
<p>&nbsp;<br>

<p>The KL-divegrnece \( D_{KL} \) achieves the minimum zero when \( p(x) == q(x) \) everywhere.</p>

<p>Note that the KL divergence is asymmetric. In cases where \( p(x) \) is
close to zero, but \( q(x) \) is significantly non-zero, the \( q \)'s effect
is disregarded. It could cause buggy results when we just want to
measure the similarity between two equally important distributions.
</p>
</section>

<section>
<h2 id="jensen-shannon-divergence">Jensen-Shannon divergence </h2>

<p>The Jensen&#8211;Shannon (JS) divergence is another measure of similarity between
two probability distributions, bounded by \( [0, 1] \). The JS-divergence is
symmetric and more smooth than the KL-divergence.
It is defined as
</p>
<p>&nbsp;<br>
$$
D_{JS}(p \| q) = \frac{1}{2} D_{KL}(p \| \frac{p + q}{2}) + \frac{1}{2} D_{KL}(q \| \frac{p + q}{2})
$$
<p>&nbsp;<br>

<p>Many practitioners believe that one reason behind GANs' big success is
switching the loss function from asymmetric KL-divergence in
traditional maximum-likelihood approach to symmetric JS-divergence.
</p>
</section>

<section>
<h2 id="generative-model-basic-overview-borrowed-from-rashcka-et-al">Generative model, basic overview (Borrowed from Rashcka et al) </h2>

Expand Down Expand Up @@ -581,14 +629,14 @@ <h2 id="the-derivation-from-last-week">The derivation from last week </h2>

$$
\begin{align*}
\log p(\boldsymbol{x}) & = \log p(\boldsymbol{x}) \int q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})dz && \text{(Multiply by $1 = \int q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})d\boldsymbol{h}$)}\\
& = \int q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})(\log p(\boldsymbol{x}))dz && \text{(Bring evidence into integral)}\\
\log p(\boldsymbol{x}) & = \log p(\boldsymbol{x}) \int q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})dh && \text{(Multiply by $1 = \int q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})d\boldsymbol{h}$)}\\
& = \int q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})(\log p(\boldsymbol{x}))dh && \text{(Bring evidence into integral)}\\
& = \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\left[\log p(\boldsymbol{x})\right] && \text{(Definition of Expectation)}\\
& = \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\left[\log\frac{p(\boldsymbol{x}, \boldsymbol{h})}{p(\boldsymbol{h}|\boldsymbol{x})}\right]&& \\
& = \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\left[\log\frac{p(\boldsymbol{x}, \boldsymbol{h})q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}{p(\boldsymbol{h}|\boldsymbol{x})q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\right]&& \text{(Multiply by $1 = \frac{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}$)}\\
& = \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\left[\log\frac{p(\boldsymbol{x}, \boldsymbol{h})}{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\right] + \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\left[\log\frac{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}{p(\boldsymbol{h}|\boldsymbol{x})}\right] && \text{(Split the Expectation)}\\
& = \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\left[\log\frac{p(\boldsymbol{x}, \boldsymbol{h})}{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\right] +
KL(q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})\vert\vert p(\boldsymbol{h}|\boldsymbol{x})) && \text{(Definition of KL Divergence)}\\
D_{KL}(q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})\vert\vert p(\boldsymbol{h}|\boldsymbol{x})) && \text{(Definition of KL Divergence)}\\
& \geq \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\left[\log\frac{p(\boldsymbol{x}, \boldsymbol{h})}{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\right] && \text{(KL Divergence always $\geq 0$)}
\end{align*}
$$
Expand Down Expand Up @@ -632,7 +680,7 @@ <h2 id="dissecting-the-equations">Dissecting the equations </h2>
{\mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\left[\log\frac{p(\boldsymbol{x}, \boldsymbol{h})}{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\right]}
&= {\mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\left[\log\frac{p_{\boldsymbol{\theta}}(\boldsymbol{x}|\boldsymbol{h})p(\boldsymbol{h})}{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\right]} && {\text{(Chain Rule of Probability)}}\\
&= {\mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\left[\log p_{\boldsymbol{\theta}}(\boldsymbol{x}|\boldsymbol{h})\right] + \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\left[\log\frac{p(\boldsymbol{h})}{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\right]} && {\text{(Split the Expectation)}}\\
&= \underbrace{{\mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\left[\log p_{\boldsymbol{\theta}}(\boldsymbol{x}|\boldsymbol{h})\right]}}_\text{reconstruction term} - \underbrace{{KL(q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\vert\vert{p(\boldsymbol{h}))}}_\text{prior matching term} && {\text{(Definition of KL Divergence)}}
&= \underbrace{{\mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\left[\log p_{\boldsymbol{\theta}}(\boldsymbol{x}|\boldsymbol{h})\right]}}_\text{reconstruction term} - \underbrace{{D_{KL}(q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\vert\vert{p(\boldsymbol{h}))}}_\text{prior matching term} && {\text{(Definition of KL Divergence)}}
\end{align*}
$$
<p>&nbsp;<br>
Expand Down Expand Up @@ -687,7 +735,7 @@ <h2 id="analytical-evaluation">Analytical evaluation </h2>
<p>&nbsp;<br>
$$
\begin{align*}
\mathrm{argmax}_{\boldsymbol{\phi}, \boldsymbol{\theta}} \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\left[\log p_{\boldsymbol{\theta}}(\boldsymbol{x}|\boldsymbol{h})\right] - KL(q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})\vert\vert p(\boldsymbol{h})) \approx \mathrm{argmax}_{\boldsymbol{\phi}, \boldsymbol{\theta}} \sum_{l=1}^{L}\log p_{\boldsymbol{\theta}}(\boldsymbol{x}|\boldsymbol{h}^{(l)}) - KL(q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})\vert\vert p(\boldsymbol{h}))
\mathrm{argmax}_{\boldsymbol{\phi}, \boldsymbol{\theta}} \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\left[\log p_{\boldsymbol{\theta}}(\boldsymbol{x}|\boldsymbol{h})\right] - D_{KL}(q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})\vert\vert p(\boldsymbol{h})) \approx \mathrm{argmax}_{\boldsymbol{\phi}, \boldsymbol{\theta}} \sum_{l=1}^{L}\log p_{\boldsymbol{\theta}}(\boldsymbol{x}|\boldsymbol{h}^{(l)}) - D_{KL}(q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})\vert\vert p(\boldsymbol{h}))
\end{align*}
$$
<p>&nbsp;<br>
Expand Down Expand Up @@ -980,6 +1028,157 @@ <h2 id="steps-in-building-a-gan-borrowed-from-rashcka-et-al">Steps in building a
<br/><br/>
</section>

<section>
<h2 id="generative-adversarial-networks">Generative Adversarial Networks </h2>
<p>Generative adversarial networks (GANs) have shown great results in
many generative tasks to replicate the real-world rich content such as
images, human language, and music. It is inspired by game theory: two
models, a generator and a discriminator, are competing with each other while
making each other stronger at the same time. However, it is rather
challenging to train a GANs model,
training instability or failure to converge.
</p>

<p>Generative adversarial networks consist of two models (in their simplest form as two opposing feed forward neural networks)</p>
<ol>
<p><li> A discriminator \( D \) estimates the probability of a given sample coming from the real dataset. It works as a critic and is optimized to tell the fake samples from the realo ones</li>
<p><li> A generator \( G \) outputs synthetic samples given a noise variable input \( z \) (\( z \) brings in potential output diversity). It is trained to capture the real data distribution in order to generate samples that can be as real as possible, or in other words, can trick the discriminator to offer a high probability.</li>
</ol>
<p>
<p>At the end of the training, the generator can be used to generate for
example new images. In this sense we have trained a model which can
produce new samples. We say that we have implicitely defined a
probability.
</p>

<p>These two models compete against each other during the training
process: the generator \( G \) is trying hard to trick the discriminator,
while the critic model \( D \) is trying hard not to be cheated. This
interesting zero-sum game between two models motivates both to improve
their functionalities.
</p>

<!-- Given, -->

<!-- \( p_{z} \) Data distribution over noise input \( z \) Usually, just uniform. -->
<!-- \( p_{g} \) The generator's distribution over data \( x \) -->
<!-- \( p_{r} \) Data distribution over real sample \( x \) -->

<p>On one hand, we want to make sure the discriminator \( D \)'s decisions
over real data are accurate by maximizing \( \mathbb{E}_{x \sim
p_{r}(x)} [\log D(x)] \). Meanwhile, given a fake sample \( G(z), z \sim
p_z(z) \), the discriminator is expected to output a probability,
\( D(G(z)) \), close to zero by maximizing \( \mathbb{E}_{z \sim p_{z}(z)}
[\log (1 - D(G(z)))] \).
</p>

<p>On the other hand, the generator is trained to increase the chances of
\( D \) producing a high probability for a fake example, thus to minimize
\( \mathbb{E}_{z \sim p_{z}(z)} [\log (1 - D(G(z)))] \).
</p>

<p>When combining both aspects together, \( D \) and \( G \) are playing a \textit{minimax game} in which we should optimize the following loss function:</p>

<p>&nbsp;<br>
$$
\begin{aligned}
\min_G \max_D L(D, G)
& = \mathbb{E}_{x \sim p_{r}(x)} [\log D(x)] + \mathbb{E}_{z \sim p_z(z)} [\log(1 - D(G(z)))] \\
& = \mathbb{E}_{x \sim p_{r}(x)} [\log D(x)] + \mathbb{E}_{x \sim p_g(x)} [\log(1 - D(x)]
\end{aligned}
$$
<p>&nbsp;<br>

<p>where \( \mathbb{E}_{x \sim p_{r}(x)} [\log D(x)] \) has no impact on \( G \) during gradient descent updates.</p>
</section>

<section>
<h2 id="optimal-value-for-d">Optimal value for \( D \) </h2>

<p>Now we have a well-defined loss function. Let's first examine what is the best value for \( D \).</p>

<p>&nbsp;<br>
$$
L(G, D) = \int_x \bigg( p_{r}(x) \log(D(x)) + p_g (x) \log(1 - D(x)) \bigg) dx
$$
<p>&nbsp;<br>

<p>Since we are interested in what is the best value of \( D(x) \) to maximize \( L(G, D) \), let us label </p>

<p>&nbsp;<br>
$$
\tilde{x} = D(x),
A=p_{r}(x),
B=p_g(x)
$$
<p>&nbsp;<br>

<p>And then what is inside the integral (we can safely ignore the integral because \( x \) is sampled over all the possible values) is:</p>

<p>&nbsp;<br>
$$
\begin{align*}
f(\tilde{x})
& = A log\tilde{x} + B log(1-\tilde{x}) \\
\frac{d f(\tilde{x})}{d \tilde{x}}
& = A \frac{1}{ln10} \frac{1}{\tilde{x}} - B \frac{1}{ln10} \frac{1}{1 - \tilde{x}} \\
& = \frac{1}{ln10} (\frac{A}{\tilde{x}} - \frac{B}{1-\tilde{x}}) \\
& = \frac{1}{ln10} \frac{A - (A + B)\tilde{x}}{\tilde{x} (1 - \tilde{x})} \\
\end{align*}
$$
<p>&nbsp;<br>

<p>Thus, set \( \frac{d f(\tilde{x})}{d \tilde{x}} = 0 \), we get the best value of the discriminator: \( D^*(x) = \tilde{x}^* = \frac{A}{A + B} = \frac{p_{r}(x)}{p_{r}(x) + p_g(x)} \in [0, 1] \).
Once the generator is trained to its optimal, \( p_g \) gets very close to \( p_{r} \). When \( p_g = p_{r} \), \( D^*(x) \) becomes \( 1/2 \).
</p>

<p>When both \( G \) and \( D \) are at their optimal values, we have \( p_g = p_{r} \) and \( D^*(x) = 1/2 \) and the loss function becomes:</p>

<p>&nbsp;<br>
$$
\begin{align*}
L(G, D^*)
&= \int_x \bigg( p_{r}(x) \log(D^*(x)) + p_g (x) \log(1 - D^*(x)) \bigg) dx \\
&= \log \frac{1}{2} \int_x p_{r}(x) dx + \log \frac{1}{2} \int_x p_g(x) dx \\
&= -2\log2
\end{align*}
$$
<p>&nbsp;<br>
</section>

<section>
<h2 id="what-does-the-loss-function-represent">What does the Loss Function Represent? </h2>

<p>The JS divergence between \( p_{r} \) and \( p_g \) can be computed as:</p>

<p>&nbsp;<br>
$$
\begin{align*}
D_{JS}(p_{r} \| p_g)
=& \frac{1}{2} D_{KL}(p_{r} || \frac{p_{r} + p_g}{2}) + \frac{1}{2} D_{KL}(p_{g} || \frac{p_{r} + p_g}{2}) \\
=& \frac{1}{2} \bigg( \log2 + \int_x p_{r}(x) \log \frac{p_{r}(x)}{p_{r} + p_g(x)} dx \bigg) + \\& \frac{1}{2} \bigg( \log2 + \int_x p_g(x) \log \frac{p_g(x)}{p_{r} + p_g(x)} dx \bigg) \\
=& \frac{1}{2} \bigg( \log4 + L(G, D^*) \bigg)
\end{align*}
$$
<p>&nbsp;<br>

<p>Thus, </p>

<p>&nbsp;<br>
$$
L(G, D^*) = 2D_{JS}(p_{r} \| p_g) - 2\log2
$$
<p>&nbsp;<br>

<p>Essentially the loss function of GAN quantifies the similarity between
the generative data distribution \( p_g \) and the real sample
distribution \( p_{r} \) by JS divergence when the discriminator is
optimal. The best \( G^* \) that replicates the real data distribution
leads to the minimum \( L(G^*, D^*) = -2\log2 \) which is aligned with
equations above.
</p>
</section>

<section>
<h2 id="more-references">More references </h2>

Expand Down
Loading

0 comments on commit 0f25c0a

Please sign in to comment.