correcting notation

CompPhysics · Apr 15, 2024 · ed8b8f9 · ed8b8f9
1 parent 0129999
commit ed8b8f9
Show file tree

Hide file tree

Showing 9 changed files with 306 additions and 278 deletions.
diff --git a/doc/pub/week13/html/week13-bs.html b/doc/pub/week13/html/week13-bs.html
@@ -753,7 +753,10 @@ <h2 id="kullback-leibler-again" class="anchor">Kullback-Leibler again </h2>
 
 <p>However, if \( \boldsymbol{h} \) is sampled from an arbitrary distribution with
 PDF \( Q(\boldsymbol{h}) \), which is not \( \mathcal{N}(0,I) \), then how does that
-help us optimize \( p(\boldsymbol{x}) \)?  The first thing we need to do is relate
+help us optimize \( p(\boldsymbol{x}) \)?
+</p>
+
+<p>The first thing we need to do is relate
 \( E_{\boldsymbol{h}\sim Q}P(\boldsymbol{x}\vert \boldsymbol{h}) \) and \( p(\boldsymbol{x}) \).  We will see where \( Q \) comes from later.
 </p>
 
@@ -770,14 +773,14 @@ <h2 id="and-applying-bayes-rule" class="anchor">And applying Bayes rule </h2>
 
 <p>We can get both \( p(\boldsymbol{x}) \) and \( p(\boldsymbol{x}\vert \boldsymbol{h}) \) into this equation by applying Bayes rule to \( p(\boldsymbol{h}|\boldsymbol{x}) \)</p>
 $$
-    \mathcal{D}\left[Q(z)\|P(z|X)\right]=E_{z\sim Q}\left[\log Q(z) - \log P(X|z) - \log P(z) \right] + \log P(X).
+    \mathcal{D}\left[Q(\boldsymbol{h})\|p(\boldsymbol{h}\vert \boldsymbol{x})\right]=E_{\boldsymbol{h}\sim Q}\left[\log Q(\boldsymbol{h}) - \log p(\boldsymbol{x}|\boldsymbol{h}) - \log p(\boldsymbol{h}) \right] + \log p(\boldsymbol{x}).
 $$
 
-<p>Here, \( \log P(X) \) comes out of the expectation because it does not depend on \( z \).
-Negating both sides, rearranging, and contracting part of \( E_{z\sim Q} \) into a KL-divergence terms yields:
+<p>Here, \( \log p(\boldsymbol{x}) \) comes out of the expectation because it does not depend on \( \boldsymbol{h} \).
+Negating both sides, rearranging, and contracting part of \( E_{\boldsymbol{h}\sim Q} \) into a KL-divergence terms yields:
 </p>
 $$
-\log P(X) - \mathcal{D}\left[Q(z)\|P(z|X)\right]=E_{z\sim Q}\left[\log P(X|z)  \right] - \mathcal{D}\left[Q(z)\|P(z)\right].
+\log p(\boldsymbol{x}) - \mathcal{D}\left[Q(\boldsymbol{h})\|p(\boldsymbol{h}\vert \boldsymbol{x})\right]=E_{\boldsymbol{h}\sim Q}\left[\log p(\boldsymbol{x}\vert\boldsymbol{h})  \right] - \mathcal{D}\left[Q(\boldsymbol{h})\|P(\boldsymbol{h})\right].
 $$
 
 
@@ -786,37 +789,37 @@ <h2 id="rearraning" class="anchor">Rearraning </h2>
 
 <p>Using Bayes rule we obtain</p>
 $$
-E_{z\sim Q}\left[\log P(Y_i|z,X_i)\right]=E_{z\sim Q}\left[\log P(z|Y_i,X_i) - \log P(z|X_i) + \log P(Y_i|X_i) \right]
+E_{\boldsymbol{h}\sim Q}\left[\log p(y_i|\boldsymbol{h},x_i)\right]=E_{\boldsymbol{h}\sim Q}\left[\log p(\boldsymbol{h}|y_i,x_i) - \log p(\boldsymbol{h}|x_i) + \log p(y_i|x_i) \right]
 $$
 
-<p>Rearranging the terms and subtracting \( E_{z\sim Q}\log Q(z) \) from both sides gives</p>
+<p>Rearranging the terms and subtracting \( E_{\boldsymbol{h}\sim Q}\log Q(\boldsymbol{h}) \) from both sides gives</p>
 $$
 \begin{array}{c}
-\log P(Y_i|X_i) - E_{z\sim Q}\left[\log Q(z)-\log P(z|X_i,Y_i)\right]=\hspace{10em}\\
-\hspace{10em}E_{z\sim Q}\left[\log P(Y_i|z,X_i)+\log P(z|X_i)-\log Q(z)\right]
+\log P(y_i|x_i) - E_{\boldsymbol{h}\sim Q}\left[\log Q(\boldsymbol{h})-\log p(\boldsymbol{h}|x_i,y_i)\right]=\hspace{10em}\\
+\hspace{10em}E_{\boldsymbol{h}\sim Q}\left[\log p(y_i|\boldsymbol{h},x_i)+\log p(\boldsymbol{h}|x_i)-\log Q(\boldsymbol{h})\right]
 \end{array}
 $$
 
-<p>Note that \( X \) is fixed, and \( Q \) can be \textit{any} distribution, not
-just a distribution which does a good job mapping \( X \) to the \( z \)'s
+<p>Note that \( \boldsymbol{x} \) is fixed, and \( Q \) can be \textit{any} distribution, not
+just a distribution which does a good job mapping \( \boldsymbol{x} \) to the \( \boldsymbol{h} \)'s
 that can produce \( X \).
 </p>
 
 <!-- !split -->
 <h2 id="inferring-the-probability" class="anchor">Inferring the probability </h2>
 
-<p>Since we are interested in inferring \( P(X) \), it makes sense to
-construct a \( Q \) which \textit{does} depend on \( X \), and in particular,
-one which makes \( \mathcal{D}\left[Q(z)\|P(z|X)\right] \) small
+<p>Since we are interested in inferring \( p(\boldsymbol{x}) \), it makes sense to
+construct a \( Q \) which \textit{does} depend on \( \boldsymbol{x} \), and in particular,
+one which makes \( \mathcal{D}\left[Q(\boldsymbol{h})\|p(\boldsymbol{h}|\boldsymbol{x})\right] \) small
 </p>
 $$
-\log P(X) - \mathcal{D}\left[Q(z|X)\|P(z|X)\right]=E_{z\sim Q}\left[\log P(X|z)  \right] - \mathcal{D}\left[Q(z|X)\|P(z)\right].
+\log p(\boldsymbol{x}) - \mathcal{D}\left[Q(\boldsymbol{h}|\boldsymbol{x})\|p(\boldsymbol{h}|\boldsymbol{x})\right]=E_{\boldsymbol{h}\sim Q}\left[\log p(\boldsymbol{x}|\boldsymbol{h})  \right] - \mathcal{D}\left[Q(\boldsymbol{h}|\boldsymbol{x})\|p(\boldsymbol{h})\right].
 $$
 
 <p>Hence, during training, it makes sense to choose a \( Q \) which will make
-\( E_{z\sim Q}[\log Q(z)- \) $\log P(z|X_i,Y_i)]$ (a
+\( E_{\boldsymbol{h}\sim Q}[\log Q(\boldsymbol{h})- \) $\log p(\boldsymbol{h}|x_i,y_i)]$ (a
 \( \mathcal{D} \)-divergence) small, such that the right hand side is a
-close approximation to \( \log P(Y_i|X_i) \).
+close approximation to \( \log p(y_i|y_i) \).
 </p>
 
 <!-- !split -->
@@ -827,16 +830,16 @@ <h2 id="central-equation-of-vaes" class="anchor">Central equation of VAEs </h2>
 </p>
 
 <ol>
-<li> The left hand side has the quantity we want to maximize, namely \( \log P(X) \) plus an error term.</li>
+<li> The left hand side has the quantity we want to maximize, namely \( \log p(\boldsymbol{x}) \) plus an error term.</li>
 <li> The right hand side is something we can optimize via stochastic gradient descent given the right choice of \( Q \).</li>
 </ol>
 <!-- !split -->
 <h2 id="setting-up-sgd" class="anchor">Setting up SGD </h2>
 <p>So how can we perform stochastic gradient descent?</p>
 
-<p>First we need to be a bit more specific about the form that \( Q(z|X) \)
+<p>First we need to be a bit more specific about the form that \( Q(\boldsymbol{h}|\boldsymbol{x}) \)
 will take.  The usual choice is to say that
-\( Q(z|X)=\mathcal{N}(z|\mu(X;\vartheta),\Sigma(X;\vartheta)) \), where
+\( Q(\boldsymbol{h}|\boldsymbol{x})=\mathcal{N}(\boldsymbol{h}|\mu(\boldsymbol{x};\vartheta),\Sigma(;\vartheta)) \), where
 \( \mu \) and \( \Sigma \) are arbitrary deterministic functions with
 parameters \( \vartheta \) that can be learned from data (we will omit
 \( \vartheta \) in later equations).  In practice, \( \mu \) and \( \Sigma \) are
@@ -848,10 +851,10 @@ <h2 id="setting-up-sgd" class="anchor">Setting up SGD </h2>
 <h2 id="more-on-the-sgd" class="anchor">More on the SGD </h2>
 
 <p>The name variational &quot;autoencoder&quot; comes from
-the fact that \( \mu \) and \( \Sigma \) are &quot;encoding&quot; \( X \) into the latent
-space \( z \).  The advantages of this choice are computational, as they
+the fact that \( \mu \) and \( \Sigma \) are &quot;encoding&quot; \( \boldsymbol{x} \) into the latent
+space \( \boldsymbol{h} \).  The advantages of this choice are computational, as they
 make it clear how to compute the right hand side.  The last
-term---\( \mathcal{D}\left[Q(z|X)\|P(z)\right] \)---is now a KL-divergence
+term---\( \mathcal{D}\left[Q(\boldsymbol{h}|\boldsymbol{x})\|p(\boldsymbol{h})\right] \)---is now a KL-divergence
 between two multivariate Gaussian distributions, which can be computed
 in closed form as:
 </p>

diff --git a/doc/pub/week13/html/week13-reveal.html b/doc/pub/week13/html/week13-reveal.html
@@ -714,7 +714,10 @@ <h2 id="kullback-leibler-again">Kullback-Leibler again </h2>
 
 <p>However, if \( \boldsymbol{h} \) is sampled from an arbitrary distribution with
 PDF \( Q(\boldsymbol{h}) \), which is not \( \mathcal{N}(0,I) \), then how does that
-help us optimize \( p(\boldsymbol{x}) \)?  The first thing we need to do is relate
+help us optimize \( p(\boldsymbol{x}) \)?
+</p>
+
+<p>The first thing we need to do is relate
 \( E_{\boldsymbol{h}\sim Q}P(\boldsymbol{x}\vert \boldsymbol{h}) \) and \( p(\boldsymbol{x}) \).  We will see where \( Q \) comes from later.
 </p>
 
@@ -734,16 +737,16 @@ <h2 id="and-applying-bayes-rule">And applying Bayes rule </h2>
 <p>We can get both \( p(\boldsymbol{x}) \) and \( p(\boldsymbol{x}\vert \boldsymbol{h}) \) into this equation by applying Bayes rule to \( p(\boldsymbol{h}|\boldsymbol{x}) \)</p>
 <p>&nbsp;<br>
 $$
-    \mathcal{D}\left[Q(z)\|P(z|X)\right]=E_{z\sim Q}\left[\log Q(z) - \log P(X|z) - \log P(z) \right] + \log P(X).
+    \mathcal{D}\left[Q(\boldsymbol{h})\|p(\boldsymbol{h}\vert \boldsymbol{x})\right]=E_{\boldsymbol{h}\sim Q}\left[\log Q(\boldsymbol{h}) - \log p(\boldsymbol{x}|\boldsymbol{h}) - \log p(\boldsymbol{h}) \right] + \log p(\boldsymbol{x}).
 $$
 <p>&nbsp;<br>
 
-<p>Here, \( \log P(X) \) comes out of the expectation because it does not depend on \( z \).
-Negating both sides, rearranging, and contracting part of \( E_{z\sim Q} \) into a KL-divergence terms yields:
+<p>Here, \( \log p(\boldsymbol{x}) \) comes out of the expectation because it does not depend on \( \boldsymbol{h} \).
+Negating both sides, rearranging, and contracting part of \( E_{\boldsymbol{h}\sim Q} \) into a KL-divergence terms yields:
 </p>
 <p>&nbsp;<br>
 $$
-\log P(X) - \mathcal{D}\left[Q(z)\|P(z|X)\right]=E_{z\sim Q}\left[\log P(X|z)  \right] - \mathcal{D}\left[Q(z)\|P(z)\right].
+\log p(\boldsymbol{x}) - \mathcal{D}\left[Q(\boldsymbol{h})\|p(\boldsymbol{h}\vert \boldsymbol{x})\right]=E_{\boldsymbol{h}\sim Q}\left[\log p(\boldsymbol{x}\vert\boldsymbol{h})  \right] - \mathcal{D}\left[Q(\boldsymbol{h})\|P(\boldsymbol{h})\right].
 $$
 <p>&nbsp;<br>
 </section>
@@ -754,43 +757,43 @@ <h2 id="rearraning">Rearraning </h2>
 <p>Using Bayes rule we obtain</p>
 <p>&nbsp;<br>
 $$
-E_{z\sim Q}\left[\log P(Y_i|z,X_i)\right]=E_{z\sim Q}\left[\log P(z|Y_i,X_i) - \log P(z|X_i) + \log P(Y_i|X_i) \right]
+E_{\boldsymbol{h}\sim Q}\left[\log p(y_i|\boldsymbol{h},x_i)\right]=E_{\boldsymbol{h}\sim Q}\left[\log p(\boldsymbol{h}|y_i,x_i) - \log p(\boldsymbol{h}|x_i) + \log p(y_i|x_i) \right]
 $$
 <p>&nbsp;<br>
 
-<p>Rearranging the terms and subtracting \( E_{z\sim Q}\log Q(z) \) from both sides gives</p>
+<p>Rearranging the terms and subtracting \( E_{\boldsymbol{h}\sim Q}\log Q(\boldsymbol{h}) \) from both sides gives</p>
 <p>&nbsp;<br>
 $$
 \begin{array}{c}
-\log P(Y_i|X_i) - E_{z\sim Q}\left[\log Q(z)-\log P(z|X_i,Y_i)\right]=\hspace{10em}\\
-\hspace{10em}E_{z\sim Q}\left[\log P(Y_i|z,X_i)+\log P(z|X_i)-\log Q(z)\right]
+\log P(y_i|x_i) - E_{\boldsymbol{h}\sim Q}\left[\log Q(\boldsymbol{h})-\log p(\boldsymbol{h}|x_i,y_i)\right]=\hspace{10em}\\
+\hspace{10em}E_{\boldsymbol{h}\sim Q}\left[\log p(y_i|\boldsymbol{h},x_i)+\log p(\boldsymbol{h}|x_i)-\log Q(\boldsymbol{h})\right]
 \end{array}
 $$
 <p>&nbsp;<br>
 
-<p>Note that \( X \) is fixed, and \( Q \) can be \textit{any} distribution, not
-just a distribution which does a good job mapping \( X \) to the \( z \)'s
+<p>Note that \( \boldsymbol{x} \) is fixed, and \( Q \) can be \textit{any} distribution, not
+just a distribution which does a good job mapping \( \boldsymbol{x} \) to the \( \boldsymbol{h} \)'s
 that can produce \( X \).
 </p>
 </section>
 
 <section>
 <h2 id="inferring-the-probability">Inferring the probability </h2>
 
-<p>Since we are interested in inferring \( P(X) \), it makes sense to
-construct a \( Q \) which \textit{does} depend on \( X \), and in particular,
-one which makes \( \mathcal{D}\left[Q(z)\|P(z|X)\right] \) small
+<p>Since we are interested in inferring \( p(\boldsymbol{x}) \), it makes sense to
+construct a \( Q \) which \textit{does} depend on \( \boldsymbol{x} \), and in particular,
+one which makes \( \mathcal{D}\left[Q(\boldsymbol{h})\|p(\boldsymbol{h}|\boldsymbol{x})\right] \) small
 </p>
 <p>&nbsp;<br>
 $$
-\log P(X) - \mathcal{D}\left[Q(z|X)\|P(z|X)\right]=E_{z\sim Q}\left[\log P(X|z)  \right] - \mathcal{D}\left[Q(z|X)\|P(z)\right].
+\log p(\boldsymbol{x}) - \mathcal{D}\left[Q(\boldsymbol{h}|\boldsymbol{x})\|p(\boldsymbol{h}|\boldsymbol{x})\right]=E_{\boldsymbol{h}\sim Q}\left[\log p(\boldsymbol{x}|\boldsymbol{h})  \right] - \mathcal{D}\left[Q(\boldsymbol{h}|\boldsymbol{x})\|p(\boldsymbol{h})\right].
 $$
 <p>&nbsp;<br>
 
 <p>Hence, during training, it makes sense to choose a \( Q \) which will make
-\( E_{z\sim Q}[\log Q(z)- \) $\log P(z|X_i,Y_i)]$ (a
+\( E_{\boldsymbol{h}\sim Q}[\log Q(\boldsymbol{h})- \) $\log p(\boldsymbol{h}|x_i,y_i)]$ (a
 \( \mathcal{D} \)-divergence) small, such that the right hand side is a
-close approximation to \( \log P(Y_i|X_i) \).
+close approximation to \( \log p(y_i|y_i) \).
 </p>
 </section>
 
@@ -802,7 +805,7 @@ <h2 id="central-equation-of-vaes">Central equation of VAEs </h2>
 </p>
 
 <ol>
-<p><li> The left hand side has the quantity we want to maximize, namely \( \log P(X) \) plus an error term.</li>
+<p><li> The left hand side has the quantity we want to maximize, namely \( \log p(\boldsymbol{x}) \) plus an error term.</li>
 <p><li> The right hand side is something we can optimize via stochastic gradient descent given the right choice of \( Q \).</li>
 </ol>
 </section>
@@ -811,9 +814,9 @@ <h2 id="central-equation-of-vaes">Central equation of VAEs </h2>
 <h2 id="setting-up-sgd">Setting up SGD </h2>
 <p>So how can we perform stochastic gradient descent?</p>
 
-<p>First we need to be a bit more specific about the form that \( Q(z|X) \)
+<p>First we need to be a bit more specific about the form that \( Q(\boldsymbol{h}|\boldsymbol{x}) \)
 will take.  The usual choice is to say that
-\( Q(z|X)=\mathcal{N}(z|\mu(X;\vartheta),\Sigma(X;\vartheta)) \), where
+\( Q(\boldsymbol{h}|\boldsymbol{x})=\mathcal{N}(\boldsymbol{h}|\mu(\boldsymbol{x};\vartheta),\Sigma(;\vartheta)) \), where
 \( \mu \) and \( \Sigma \) are arbitrary deterministic functions with
 parameters \( \vartheta \) that can be learned from data (we will omit
 \( \vartheta \) in later equations).  In practice, \( \mu \) and \( \Sigma \) are
@@ -826,10 +829,10 @@ <h2 id="setting-up-sgd">Setting up SGD </h2>
 <h2 id="more-on-the-sgd">More on the SGD </h2>
 
 <p>The name variational &quot;autoencoder&quot; comes from
-the fact that \( \mu \) and \( \Sigma \) are &quot;encoding&quot; \( X \) into the latent
-space \( z \).  The advantages of this choice are computational, as they
+the fact that \( \mu \) and \( \Sigma \) are &quot;encoding&quot; \( \boldsymbol{x} \) into the latent
+space \( \boldsymbol{h} \).  The advantages of this choice are computational, as they
 make it clear how to compute the right hand side.  The last
-term---\( \mathcal{D}\left[Q(z|X)\|P(z)\right] \)---is now a KL-divergence
+term---\( \mathcal{D}\left[Q(\boldsymbol{h}|\boldsymbol{x})\|p(\boldsymbol{h})\right] \)---is now a KL-divergence
 between two multivariate Gaussian distributions, which can be computed
 in closed form as:
 </p>