update

CompPhysics · May 1, 2024 · 0f25c0a · 0f25c0a
1 parent d0e6c7e
commit 0f25c0a
Show file tree

Hide file tree

Showing 8 changed files with 1,425 additions and 624 deletions.
diff --git a/doc/pub/week15/html/week15-bs.html b/doc/pub/week15/html/week15-bs.html
diff --git a/doc/pub/week15/html/week15-reveal.html b/doc/pub/week15/html/week15-reveal.html
@@ -203,7 +203,7 @@ <h2 id="plans-for-the-week-of-april-29-may-3-2024">Plans for the week of April 2
 <ol>
 <p><li> Summary of Variational Autoencoders</li>
 <p><li> Generative Adversarial Networks (GANs), see <a href="https://lilianweng.github.io/posts/2017-08-20-gan/" target="_blank"><tt>https://lilianweng.github.io/posts/2017-08-20-gan/</tt></a> for nice overview</li>
-<p><li> Start discussion of diffusion models</li>
+<p><li> Start discussion of diffusion models, motivation</li>
 <p><li> <a href="https://youtu.be/Cg8n9aWwHuU" target="_blank">Video of lecture</a></li>
 <p><li> <a href="https://github.com/CompPhysics/AdvancedMachineLearning/blob/main/doc/HandwrittenNotes/2024/NotesApril30.pdf" target="_blank">Whiteboard notes</a></li>
 </ol>
@@ -477,6 +477,54 @@ <h2 id="final-expression">Final expression </h2>
 </p>
 </section>
 
+<section>
+<h2 id="kullback-leibler-divergence">Kullback-Leibler divergence </h2>
+
+<p>Before we continue, we need to remind ourselves about the
+Kullback-Leibler divergence introduced earlier. This will also allow
+us to introduce another measure used in connection with the training
+of Generative Adversarial Networks, the so-called Jensen-Shannon divergence..
+These metrics are useful for quantifying the similarity between two probability distributions.
+</p>
+
+<p>The Kullback&#8211;Leibler (KL) divergence, labeled \( D_{KL} \),   measures how one probability distribution \( p \) diverges from a second expected probability distribution \( q \),
+that is
+</p>
+<p>&nbsp;<br>
+$$
+D_{KL}(p \| q) = \int_x p(x) \log \frac{p(x)}{q(x)} dx.
+$$
+<p>&nbsp;<br>
+
+<p>The KL-divegrnece \( D_{KL} \) achieves the minimum zero when \( p(x) == q(x) \) everywhere.</p>
+
+<p>Note that the KL divergence is asymmetric. In cases where \( p(x) \) is
+close to zero, but \( q(x) \) is significantly non-zero, the \( q \)'s effect
+is disregarded. It could cause buggy results when we just want to
+measure the similarity between two equally important distributions.
+</p>
+</section>
+
+<section>
+<h2 id="jensen-shannon-divergence">Jensen-Shannon divergence </h2>
+
+<p>The Jensen&#8211;Shannon (JS) divergence is another measure of similarity between
+two probability distributions, bounded by \( [0, 1] \). The JS-divergence is
+symmetric and more smooth than the KL-divergence.
+It is defined as
+</p>
+<p>&nbsp;<br>
+$$
+D_{JS}(p \| q) = \frac{1}{2} D_{KL}(p \| \frac{p + q}{2}) + \frac{1}{2} D_{KL}(q \| \frac{p + q}{2})
+$$
+<p>&nbsp;<br>
+
+<p>Many practitioners believe that one reason behind GANs' big success is
+switching the loss function from asymmetric KL-divergence in
+traditional maximum-likelihood approach to symmetric JS-divergence.
+</p>
+</section>
+
 <section>
 <h2 id="generative-model-basic-overview-borrowed-from-rashcka-et-al">Generative model,  basic overview (Borrowed from Rashcka et al) </h2>
 
@@ -581,14 +629,14 @@ <h2 id="the-derivation-from-last-week">The derivation from last week </h2>
 
 $$
 \begin{align*}
-\log p(\boldsymbol{x}) & = \log p(\boldsymbol{x}) \int q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})dz && \text{(Multiply by $1 = \int q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})d\boldsymbol{h}$)}\\
-          & = \int q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})(\log p(\boldsymbol{x}))dz && \text{(Bring evidence into integral)}\\
+\log p(\boldsymbol{x}) & = \log p(\boldsymbol{x}) \int q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})dh && \text{(Multiply by $1 = \int q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})d\boldsymbol{h}$)}\\
+          & = \int q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})(\log p(\boldsymbol{x}))dh && \text{(Bring evidence into integral)}\\
           & = \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\left[\log p(\boldsymbol{x})\right] && \text{(Definition of Expectation)}\\
           & = \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\left[\log\frac{p(\boldsymbol{x}, \boldsymbol{h})}{p(\boldsymbol{h}|\boldsymbol{x})}\right]&& \\
           & = \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\left[\log\frac{p(\boldsymbol{x}, \boldsymbol{h})q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}{p(\boldsymbol{h}|\boldsymbol{x})q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\right]&& \text{(Multiply by $1 = \frac{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}$)}\\
           & = \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\left[\log\frac{p(\boldsymbol{x}, \boldsymbol{h})}{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\right] + \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\left[\log\frac{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}{p(\boldsymbol{h}|\boldsymbol{x})}\right] && \text{(Split the Expectation)}\\
           & = \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\left[\log\frac{p(\boldsymbol{x}, \boldsymbol{h})}{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\right] +
-	  KL(q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})\vert\vert p(\boldsymbol{h}|\boldsymbol{x}))  && \text{(Definition of KL Divergence)}\\
+	  D_{KL}(q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})\vert\vert p(\boldsymbol{h}|\boldsymbol{x}))  && \text{(Definition of KL Divergence)}\\
           & \geq \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\left[\log\frac{p(\boldsymbol{x}, \boldsymbol{h})}{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\right]  && \text{(KL Divergence always $\geq 0$)}
 \end{align*}
 $$
@@ -632,7 +680,7 @@ <h2 id="dissecting-the-equations">Dissecting the equations </h2>
 {\mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\left[\log\frac{p(\boldsymbol{x}, \boldsymbol{h})}{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\right]}
 &= {\mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\left[\log\frac{p_{\boldsymbol{\theta}}(\boldsymbol{x}|\boldsymbol{h})p(\boldsymbol{h})}{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\right]}         && {\text{(Chain Rule of Probability)}}\\
 &= {\mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\left[\log p_{\boldsymbol{\theta}}(\boldsymbol{x}|\boldsymbol{h})\right] + \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\left[\log\frac{p(\boldsymbol{h})}{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\right]}         && {\text{(Split the Expectation)}}\\
-&= \underbrace{{\mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\left[\log p_{\boldsymbol{\theta}}(\boldsymbol{x}|\boldsymbol{h})\right]}}_\text{reconstruction term} - \underbrace{{KL(q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\vert\vert{p(\boldsymbol{h}))}}_\text{prior matching term} && {\text{(Definition of KL Divergence)}}
+&= \underbrace{{\mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\left[\log p_{\boldsymbol{\theta}}(\boldsymbol{x}|\boldsymbol{h})\right]}}_\text{reconstruction term} - \underbrace{{D_{KL}(q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\vert\vert{p(\boldsymbol{h}))}}_\text{prior matching term} && {\text{(Definition of KL Divergence)}}
 \end{align*}
 $$
 <p>&nbsp;<br>
@@ -687,7 +735,7 @@ <h2 id="analytical-evaluation">Analytical evaluation </h2>
 <p>&nbsp;<br>
 $$
 \begin{align*}
-  \mathrm{argmax}_{\boldsymbol{\phi}, \boldsymbol{\theta}} \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\left[\log p_{\boldsymbol{\theta}}(\boldsymbol{x}|\boldsymbol{h})\right] - KL(q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})\vert\vert p(\boldsymbol{h})) \approx \mathrm{argmax}_{\boldsymbol{\phi}, \boldsymbol{\theta}} \sum_{l=1}^{L}\log p_{\boldsymbol{\theta}}(\boldsymbol{x}|\boldsymbol{h}^{(l)}) - KL(q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})\vert\vert p(\boldsymbol{h}))
+  \mathrm{argmax}_{\boldsymbol{\phi}, \boldsymbol{\theta}} \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})}\left[\log p_{\boldsymbol{\theta}}(\boldsymbol{x}|\boldsymbol{h})\right] - D_{KL}(q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})\vert\vert p(\boldsymbol{h})) \approx \mathrm{argmax}_{\boldsymbol{\phi}, \boldsymbol{\theta}} \sum_{l=1}^{L}\log p_{\boldsymbol{\theta}}(\boldsymbol{x}|\boldsymbol{h}^{(l)}) - D_{KL}(q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})\vert\vert p(\boldsymbol{h}))
 \end{align*}
 $$
 <p>&nbsp;<br>
@@ -980,6 +1028,157 @@ <h2 id="steps-in-building-a-gan-borrowed-from-rashcka-et-al">Steps in building a
 <br/><br/>
 </section>
 
+<section>
+<h2 id="generative-adversarial-networks">Generative Adversarial Networks </h2>
+<p>Generative adversarial networks (GANs) have shown great results in
+many generative tasks to replicate the real-world rich content such as
+images, human language, and music. It is inspired by game theory: two
+models, a generator and a discriminator, are competing with each other while
+making each other stronger at the same time. However, it is rather
+challenging to train a GANs model, 
+training instability or failure to converge.
+</p>
+
+<p>Generative adversarial networks  consist  of two models (in their simplest form as two opposing feed forward neural networks)</p>
+<ol>
+<p><li> A discriminator \( D \) estimates the probability of a given sample coming from the real dataset. It works as a critic and is optimized to tell the fake samples from the realo ones</li>
+<p><li> A generator \( G \) outputs synthetic samples given a noise variable input \( z \) (\( z \) brings in potential output diversity). It is trained to capture the real data distribution in order to generate samples that can be as real as possible, or in other words, can trick the discriminator to offer a high probability.</li>
+</ol>
+<p>
+<p>At the end of the training, the generator can be used to generate for
+example new images. In this sense we have trained a model which can
+produce new samples. We say that we have implicitely defined a
+probability.
+</p>
+
+<p>These two models compete against each other during the training
+process: the generator \( G \) is trying hard to trick the discriminator,
+while the critic model \( D \) is trying hard not to be cheated. This
+interesting zero-sum game between two models motivates both to improve
+their functionalities.
+</p>
+
+<!-- Given, -->
+
+<!-- \( p_{z} \)  Data distribution over noise input \( z \)  Usually, just uniform. -->
+<!-- \( p_{g} \)  The generator's distribution over data \( x \) -->
+<!-- \( p_{r} \)   Data distribution over real sample \( x \) -->
+
+<p>On one hand, we want to make sure the discriminator \( D \)'s decisions
+over real data are accurate by maximizing \( \mathbb{E}_{x \sim
+p_{r}(x)} [\log D(x)] \). Meanwhile, given a fake sample \( G(z), z \sim
+p_z(z) \), the discriminator is expected to output a probability,
+\( D(G(z)) \), close to zero by maximizing \( \mathbb{E}_{z \sim p_{z}(z)}
+[\log (1 - D(G(z)))] \).
+</p>
+
+<p>On the other hand, the generator is trained to increase the chances of
+\( D \) producing a high probability for a fake example, thus to minimize
+\( \mathbb{E}_{z \sim p_{z}(z)} [\log (1 - D(G(z)))] \).
+</p>
+
+<p>When combining both aspects together, \( D \) and \( G \) are playing a \textit{minimax game} in which we should optimize the following loss function:</p>
+
+<p>&nbsp;<br>
+$$
+\begin{aligned}
+\min_G \max_D L(D, G) 
+& = \mathbb{E}_{x \sim p_{r}(x)} [\log D(x)] + \mathbb{E}_{z \sim p_z(z)} [\log(1 - D(G(z)))] \\
+& = \mathbb{E}_{x \sim p_{r}(x)} [\log D(x)] + \mathbb{E}_{x \sim p_g(x)} [\log(1 - D(x)]
+\end{aligned}
+$$
+<p>&nbsp;<br>
+
+<p>where \( \mathbb{E}_{x \sim p_{r}(x)} [\log D(x)] \) has no impact on \( G \) during gradient descent updates.</p>
+</section>
+
+<section>
+<h2 id="optimal-value-for-d">Optimal value for \( D \) </h2>
+
+<p>Now we have a well-defined loss function. Let's first examine what is the best value for \( D \).</p>
+
+<p>&nbsp;<br>
+$$
+L(G, D) = \int_x \bigg( p_{r}(x) \log(D(x)) + p_g (x) \log(1 - D(x)) \bigg) dx
+$$
+<p>&nbsp;<br>
+
+<p>Since we are interested in what is the best value of \( D(x) \) to maximize \( L(G, D) \), let us label </p>
+
+<p>&nbsp;<br>
+$$
+\tilde{x} = D(x), 
+A=p_{r}(x), 
+B=p_g(x)
+$$
+<p>&nbsp;<br>
+
+<p>And then what is inside the integral (we can safely ignore the integral because \( x \) is sampled over all the possible values) is:</p>
+
+<p>&nbsp;<br>
+$$
+\begin{align*}
+f(\tilde{x}) 
+& = A log\tilde{x} + B log(1-\tilde{x}) \\
+\frac{d f(\tilde{x})}{d \tilde{x}}
+& = A \frac{1}{ln10} \frac{1}{\tilde{x}} - B \frac{1}{ln10} \frac{1}{1 - \tilde{x}} \\
+& = \frac{1}{ln10} (\frac{A}{\tilde{x}} - \frac{B}{1-\tilde{x}}) \\
+& = \frac{1}{ln10} \frac{A - (A + B)\tilde{x}}{\tilde{x} (1 - \tilde{x})} \\
+\end{align*}
+$$
+<p>&nbsp;<br>
+
+<p>Thus, set \( \frac{d f(\tilde{x})}{d \tilde{x}} = 0 \), we get the best value of the discriminator: \( D^*(x) = \tilde{x}^* = \frac{A}{A + B} = \frac{p_{r}(x)}{p_{r}(x) + p_g(x)} \in [0, 1] \).
+Once the generator is trained to its optimal, \( p_g \) gets very close to \( p_{r} \). When \( p_g = p_{r} \), \( D^*(x) \) becomes \( 1/2 \).
+</p>
+
+<p>When both \( G \) and \( D \) are at their optimal values, we have \( p_g = p_{r} \) and \( D^*(x) = 1/2 \) and the loss function becomes:</p>
+
+<p>&nbsp;<br>
+$$
+\begin{align*}
+L(G, D^*) 
+&= \int_x \bigg( p_{r}(x) \log(D^*(x)) + p_g (x) \log(1 - D^*(x)) \bigg) dx \\
+&= \log \frac{1}{2} \int_x p_{r}(x) dx + \log \frac{1}{2} \int_x p_g(x) dx \\
+&= -2\log2
+\end{align*}
+$$
+<p>&nbsp;<br>
+</section>
+
+<section>
+<h2 id="what-does-the-loss-function-represent">What does the Loss Function Represent? </h2>
+
+<p>The JS divergence between \( p_{r} \) and \( p_g \) can be computed as:</p>
+
+<p>&nbsp;<br>
+$$
+\begin{align*}
+D_{JS}(p_{r} \| p_g) 
+=& \frac{1}{2} D_{KL}(p_{r} || \frac{p_{r} + p_g}{2}) + \frac{1}{2} D_{KL}(p_{g} || \frac{p_{r} + p_g}{2}) \\
+=& \frac{1}{2} \bigg( \log2 + \int_x p_{r}(x) \log \frac{p_{r}(x)}{p_{r} + p_g(x)} dx \bigg) + \\& \frac{1}{2} \bigg( \log2 + \int_x p_g(x) \log \frac{p_g(x)}{p_{r} + p_g(x)} dx \bigg) \\
+=& \frac{1}{2} \bigg( \log4 + L(G, D^*) \bigg)
+\end{align*}
+$$
+<p>&nbsp;<br>
+
+<p>Thus, </p>
+
+<p>&nbsp;<br>
+$$
+L(G, D^*) = 2D_{JS}(p_{r} \| p_g) - 2\log2
+$$
+<p>&nbsp;<br>
+
+<p>Essentially the loss function of GAN quantifies the similarity between
+the generative data distribution \( p_g \) and the real sample
+distribution \( p_{r} \) by JS divergence when the discriminator is
+optimal. The best \( G^* \) that replicates the real data distribution
+leads to the minimum \( L(G^*, D^*) = -2\log2 \) which is aligned with
+equations above.
+</p>
+</section>
+
 <section>
 <h2 id="more-references">More references </h2>