Skip to content

Commit

Permalink
correcting notation
Browse files Browse the repository at this point in the history
  • Loading branch information
mhjensen committed Apr 15, 2024
1 parent 0129999 commit ed8b8f9
Show file tree
Hide file tree
Showing 9 changed files with 306 additions and 278 deletions.
49 changes: 26 additions & 23 deletions doc/pub/week13/html/week13-bs.html
Original file line number Diff line number Diff line change
Expand Up @@ -753,7 +753,10 @@ <h2 id="kullback-leibler-again" class="anchor">Kullback-Leibler again </h2>

<p>However, if \( \boldsymbol{h} \) is sampled from an arbitrary distribution with
PDF \( Q(\boldsymbol{h}) \), which is not \( \mathcal{N}(0,I) \), then how does that
help us optimize \( p(\boldsymbol{x}) \)? The first thing we need to do is relate
help us optimize \( p(\boldsymbol{x}) \)?
</p>

<p>The first thing we need to do is relate
\( E_{\boldsymbol{h}\sim Q}P(\boldsymbol{x}\vert \boldsymbol{h}) \) and \( p(\boldsymbol{x}) \). We will see where \( Q \) comes from later.
</p>

Expand All @@ -770,14 +773,14 @@ <h2 id="and-applying-bayes-rule" class="anchor">And applying Bayes rule </h2>

<p>We can get both \( p(\boldsymbol{x}) \) and \( p(\boldsymbol{x}\vert \boldsymbol{h}) \) into this equation by applying Bayes rule to \( p(\boldsymbol{h}|\boldsymbol{x}) \)</p>
$$
\mathcal{D}\left[Q(z)\|P(z|X)\right]=E_{z\sim Q}\left[\log Q(z) - \log P(X|z) - \log P(z) \right] + \log P(X).
\mathcal{D}\left[Q(\boldsymbol{h})\|p(\boldsymbol{h}\vert \boldsymbol{x})\right]=E_{\boldsymbol{h}\sim Q}\left[\log Q(\boldsymbol{h}) - \log p(\boldsymbol{x}|\boldsymbol{h}) - \log p(\boldsymbol{h}) \right] + \log p(\boldsymbol{x}).
$$

<p>Here, \( \log P(X) \) comes out of the expectation because it does not depend on \( z \).
Negating both sides, rearranging, and contracting part of \( E_{z\sim Q} \) into a KL-divergence terms yields:
<p>Here, \( \log p(\boldsymbol{x}) \) comes out of the expectation because it does not depend on \( \boldsymbol{h} \).
Negating both sides, rearranging, and contracting part of \( E_{\boldsymbol{h}\sim Q} \) into a KL-divergence terms yields:
</p>
$$
\log P(X) - \mathcal{D}\left[Q(z)\|P(z|X)\right]=E_{z\sim Q}\left[\log P(X|z) \right] - \mathcal{D}\left[Q(z)\|P(z)\right].
\log p(\boldsymbol{x}) - \mathcal{D}\left[Q(\boldsymbol{h})\|p(\boldsymbol{h}\vert \boldsymbol{x})\right]=E_{\boldsymbol{h}\sim Q}\left[\log p(\boldsymbol{x}\vert\boldsymbol{h}) \right] - \mathcal{D}\left[Q(\boldsymbol{h})\|P(\boldsymbol{h})\right].
$$


Expand All @@ -786,37 +789,37 @@ <h2 id="rearraning" class="anchor">Rearraning </h2>

<p>Using Bayes rule we obtain</p>
$$
E_{z\sim Q}\left[\log P(Y_i|z,X_i)\right]=E_{z\sim Q}\left[\log P(z|Y_i,X_i) - \log P(z|X_i) + \log P(Y_i|X_i) \right]
E_{\boldsymbol{h}\sim Q}\left[\log p(y_i|\boldsymbol{h},x_i)\right]=E_{\boldsymbol{h}\sim Q}\left[\log p(\boldsymbol{h}|y_i,x_i) - \log p(\boldsymbol{h}|x_i) + \log p(y_i|x_i) \right]
$$

<p>Rearranging the terms and subtracting \( E_{z\sim Q}\log Q(z) \) from both sides gives</p>
<p>Rearranging the terms and subtracting \( E_{\boldsymbol{h}\sim Q}\log Q(\boldsymbol{h}) \) from both sides gives</p>
$$
\begin{array}{c}
\log P(Y_i|X_i) - E_{z\sim Q}\left[\log Q(z)-\log P(z|X_i,Y_i)\right]=\hspace{10em}\\
\hspace{10em}E_{z\sim Q}\left[\log P(Y_i|z,X_i)+\log P(z|X_i)-\log Q(z)\right]
\log P(y_i|x_i) - E_{\boldsymbol{h}\sim Q}\left[\log Q(\boldsymbol{h})-\log p(\boldsymbol{h}|x_i,y_i)\right]=\hspace{10em}\\
\hspace{10em}E_{\boldsymbol{h}\sim Q}\left[\log p(y_i|\boldsymbol{h},x_i)+\log p(\boldsymbol{h}|x_i)-\log Q(\boldsymbol{h})\right]
\end{array}
$$

<p>Note that \( X \) is fixed, and \( Q \) can be \textit{any} distribution, not
just a distribution which does a good job mapping \( X \) to the \( z \)'s
<p>Note that \( \boldsymbol{x} \) is fixed, and \( Q \) can be \textit{any} distribution, not
just a distribution which does a good job mapping \( \boldsymbol{x} \) to the \( \boldsymbol{h} \)'s
that can produce \( X \).
</p>

<!-- !split -->
<h2 id="inferring-the-probability" class="anchor">Inferring the probability </h2>

<p>Since we are interested in inferring \( P(X) \), it makes sense to
construct a \( Q \) which \textit{does} depend on \( X \), and in particular,
one which makes \( \mathcal{D}\left[Q(z)\|P(z|X)\right] \) small
<p>Since we are interested in inferring \( p(\boldsymbol{x}) \), it makes sense to
construct a \( Q \) which \textit{does} depend on \( \boldsymbol{x} \), and in particular,
one which makes \( \mathcal{D}\left[Q(\boldsymbol{h})\|p(\boldsymbol{h}|\boldsymbol{x})\right] \) small
</p>
$$
\log P(X) - \mathcal{D}\left[Q(z|X)\|P(z|X)\right]=E_{z\sim Q}\left[\log P(X|z) \right] - \mathcal{D}\left[Q(z|X)\|P(z)\right].
\log p(\boldsymbol{x}) - \mathcal{D}\left[Q(\boldsymbol{h}|\boldsymbol{x})\|p(\boldsymbol{h}|\boldsymbol{x})\right]=E_{\boldsymbol{h}\sim Q}\left[\log p(\boldsymbol{x}|\boldsymbol{h}) \right] - \mathcal{D}\left[Q(\boldsymbol{h}|\boldsymbol{x})\|p(\boldsymbol{h})\right].
$$

<p>Hence, during training, it makes sense to choose a \( Q \) which will make
\( E_{z\sim Q}[\log Q(z)- \) $\log P(z|X_i,Y_i)]$ (a
\( E_{\boldsymbol{h}\sim Q}[\log Q(\boldsymbol{h})- \) $\log p(\boldsymbol{h}|x_i,y_i)]$ (a
\( \mathcal{D} \)-divergence) small, such that the right hand side is a
close approximation to \( \log P(Y_i|X_i) \).
close approximation to \( \log p(y_i|y_i) \).
</p>

<!-- !split -->
Expand All @@ -827,16 +830,16 @@ <h2 id="central-equation-of-vaes" class="anchor">Central equation of VAEs </h2>
</p>

<ol>
<li> The left hand side has the quantity we want to maximize, namely \( \log P(X) \) plus an error term.</li>
<li> The left hand side has the quantity we want to maximize, namely \( \log p(\boldsymbol{x}) \) plus an error term.</li>
<li> The right hand side is something we can optimize via stochastic gradient descent given the right choice of \( Q \).</li>
</ol>
<!-- !split -->
<h2 id="setting-up-sgd" class="anchor">Setting up SGD </h2>
<p>So how can we perform stochastic gradient descent?</p>

<p>First we need to be a bit more specific about the form that \( Q(z|X) \)
<p>First we need to be a bit more specific about the form that \( Q(\boldsymbol{h}|\boldsymbol{x}) \)
will take. The usual choice is to say that
\( Q(z|X)=\mathcal{N}(z|\mu(X;\vartheta),\Sigma(X;\vartheta)) \), where
\( Q(\boldsymbol{h}|\boldsymbol{x})=\mathcal{N}(\boldsymbol{h}|\mu(\boldsymbol{x};\vartheta),\Sigma(;\vartheta)) \), where
\( \mu \) and \( \Sigma \) are arbitrary deterministic functions with
parameters \( \vartheta \) that can be learned from data (we will omit
\( \vartheta \) in later equations). In practice, \( \mu \) and \( \Sigma \) are
Expand All @@ -848,10 +851,10 @@ <h2 id="setting-up-sgd" class="anchor">Setting up SGD </h2>
<h2 id="more-on-the-sgd" class="anchor">More on the SGD </h2>

<p>The name variational &quot;autoencoder&quot; comes from
the fact that \( \mu \) and \( \Sigma \) are &quot;encoding&quot; \( X \) into the latent
space \( z \). The advantages of this choice are computational, as they
the fact that \( \mu \) and \( \Sigma \) are &quot;encoding&quot; \( \boldsymbol{x} \) into the latent
space \( \boldsymbol{h} \). The advantages of this choice are computational, as they
make it clear how to compute the right hand side. The last
term---\( \mathcal{D}\left[Q(z|X)\|P(z)\right] \)---is now a KL-divergence
term---\( \mathcal{D}\left[Q(\boldsymbol{h}|\boldsymbol{x})\|p(\boldsymbol{h})\right] \)---is now a KL-divergence
between two multivariate Gaussian distributions, which can be computed
in closed form as:
</p>
Expand Down
49 changes: 26 additions & 23 deletions doc/pub/week13/html/week13-reveal.html
Original file line number Diff line number Diff line change
Expand Up @@ -714,7 +714,10 @@ <h2 id="kullback-leibler-again">Kullback-Leibler again </h2>

<p>However, if \( \boldsymbol{h} \) is sampled from an arbitrary distribution with
PDF \( Q(\boldsymbol{h}) \), which is not \( \mathcal{N}(0,I) \), then how does that
help us optimize \( p(\boldsymbol{x}) \)? The first thing we need to do is relate
help us optimize \( p(\boldsymbol{x}) \)?
</p>

<p>The first thing we need to do is relate
\( E_{\boldsymbol{h}\sim Q}P(\boldsymbol{x}\vert \boldsymbol{h}) \) and \( p(\boldsymbol{x}) \). We will see where \( Q \) comes from later.
</p>

Expand All @@ -734,16 +737,16 @@ <h2 id="and-applying-bayes-rule">And applying Bayes rule </h2>
<p>We can get both \( p(\boldsymbol{x}) \) and \( p(\boldsymbol{x}\vert \boldsymbol{h}) \) into this equation by applying Bayes rule to \( p(\boldsymbol{h}|\boldsymbol{x}) \)</p>
<p>&nbsp;<br>
$$
\mathcal{D}\left[Q(z)\|P(z|X)\right]=E_{z\sim Q}\left[\log Q(z) - \log P(X|z) - \log P(z) \right] + \log P(X).
\mathcal{D}\left[Q(\boldsymbol{h})\|p(\boldsymbol{h}\vert \boldsymbol{x})\right]=E_{\boldsymbol{h}\sim Q}\left[\log Q(\boldsymbol{h}) - \log p(\boldsymbol{x}|\boldsymbol{h}) - \log p(\boldsymbol{h}) \right] + \log p(\boldsymbol{x}).
$$
<p>&nbsp;<br>

<p>Here, \( \log P(X) \) comes out of the expectation because it does not depend on \( z \).
Negating both sides, rearranging, and contracting part of \( E_{z\sim Q} \) into a KL-divergence terms yields:
<p>Here, \( \log p(\boldsymbol{x}) \) comes out of the expectation because it does not depend on \( \boldsymbol{h} \).
Negating both sides, rearranging, and contracting part of \( E_{\boldsymbol{h}\sim Q} \) into a KL-divergence terms yields:
</p>
<p>&nbsp;<br>
$$
\log P(X) - \mathcal{D}\left[Q(z)\|P(z|X)\right]=E_{z\sim Q}\left[\log P(X|z) \right] - \mathcal{D}\left[Q(z)\|P(z)\right].
\log p(\boldsymbol{x}) - \mathcal{D}\left[Q(\boldsymbol{h})\|p(\boldsymbol{h}\vert \boldsymbol{x})\right]=E_{\boldsymbol{h}\sim Q}\left[\log p(\boldsymbol{x}\vert\boldsymbol{h}) \right] - \mathcal{D}\left[Q(\boldsymbol{h})\|P(\boldsymbol{h})\right].
$$
<p>&nbsp;<br>
</section>
Expand All @@ -754,43 +757,43 @@ <h2 id="rearraning">Rearraning </h2>
<p>Using Bayes rule we obtain</p>
<p>&nbsp;<br>
$$
E_{z\sim Q}\left[\log P(Y_i|z,X_i)\right]=E_{z\sim Q}\left[\log P(z|Y_i,X_i) - \log P(z|X_i) + \log P(Y_i|X_i) \right]
E_{\boldsymbol{h}\sim Q}\left[\log p(y_i|\boldsymbol{h},x_i)\right]=E_{\boldsymbol{h}\sim Q}\left[\log p(\boldsymbol{h}|y_i,x_i) - \log p(\boldsymbol{h}|x_i) + \log p(y_i|x_i) \right]
$$
<p>&nbsp;<br>

<p>Rearranging the terms and subtracting \( E_{z\sim Q}\log Q(z) \) from both sides gives</p>
<p>Rearranging the terms and subtracting \( E_{\boldsymbol{h}\sim Q}\log Q(\boldsymbol{h}) \) from both sides gives</p>
<p>&nbsp;<br>
$$
\begin{array}{c}
\log P(Y_i|X_i) - E_{z\sim Q}\left[\log Q(z)-\log P(z|X_i,Y_i)\right]=\hspace{10em}\\
\hspace{10em}E_{z\sim Q}\left[\log P(Y_i|z,X_i)+\log P(z|X_i)-\log Q(z)\right]
\log P(y_i|x_i) - E_{\boldsymbol{h}\sim Q}\left[\log Q(\boldsymbol{h})-\log p(\boldsymbol{h}|x_i,y_i)\right]=\hspace{10em}\\
\hspace{10em}E_{\boldsymbol{h}\sim Q}\left[\log p(y_i|\boldsymbol{h},x_i)+\log p(\boldsymbol{h}|x_i)-\log Q(\boldsymbol{h})\right]
\end{array}
$$
<p>&nbsp;<br>

<p>Note that \( X \) is fixed, and \( Q \) can be \textit{any} distribution, not
just a distribution which does a good job mapping \( X \) to the \( z \)'s
<p>Note that \( \boldsymbol{x} \) is fixed, and \( Q \) can be \textit{any} distribution, not
just a distribution which does a good job mapping \( \boldsymbol{x} \) to the \( \boldsymbol{h} \)'s
that can produce \( X \).
</p>
</section>

<section>
<h2 id="inferring-the-probability">Inferring the probability </h2>

<p>Since we are interested in inferring \( P(X) \), it makes sense to
construct a \( Q \) which \textit{does} depend on \( X \), and in particular,
one which makes \( \mathcal{D}\left[Q(z)\|P(z|X)\right] \) small
<p>Since we are interested in inferring \( p(\boldsymbol{x}) \), it makes sense to
construct a \( Q \) which \textit{does} depend on \( \boldsymbol{x} \), and in particular,
one which makes \( \mathcal{D}\left[Q(\boldsymbol{h})\|p(\boldsymbol{h}|\boldsymbol{x})\right] \) small
</p>
<p>&nbsp;<br>
$$
\log P(X) - \mathcal{D}\left[Q(z|X)\|P(z|X)\right]=E_{z\sim Q}\left[\log P(X|z) \right] - \mathcal{D}\left[Q(z|X)\|P(z)\right].
\log p(\boldsymbol{x}) - \mathcal{D}\left[Q(\boldsymbol{h}|\boldsymbol{x})\|p(\boldsymbol{h}|\boldsymbol{x})\right]=E_{\boldsymbol{h}\sim Q}\left[\log p(\boldsymbol{x}|\boldsymbol{h}) \right] - \mathcal{D}\left[Q(\boldsymbol{h}|\boldsymbol{x})\|p(\boldsymbol{h})\right].
$$
<p>&nbsp;<br>

<p>Hence, during training, it makes sense to choose a \( Q \) which will make
\( E_{z\sim Q}[\log Q(z)- \) $\log P(z|X_i,Y_i)]$ (a
\( E_{\boldsymbol{h}\sim Q}[\log Q(\boldsymbol{h})- \) $\log p(\boldsymbol{h}|x_i,y_i)]$ (a
\( \mathcal{D} \)-divergence) small, such that the right hand side is a
close approximation to \( \log P(Y_i|X_i) \).
close approximation to \( \log p(y_i|y_i) \).
</p>
</section>

Expand All @@ -802,7 +805,7 @@ <h2 id="central-equation-of-vaes">Central equation of VAEs </h2>
</p>

<ol>
<p><li> The left hand side has the quantity we want to maximize, namely \( \log P(X) \) plus an error term.</li>
<p><li> The left hand side has the quantity we want to maximize, namely \( \log p(\boldsymbol{x}) \) plus an error term.</li>
<p><li> The right hand side is something we can optimize via stochastic gradient descent given the right choice of \( Q \).</li>
</ol>
</section>
Expand All @@ -811,9 +814,9 @@ <h2 id="central-equation-of-vaes">Central equation of VAEs </h2>
<h2 id="setting-up-sgd">Setting up SGD </h2>
<p>So how can we perform stochastic gradient descent?</p>

<p>First we need to be a bit more specific about the form that \( Q(z|X) \)
<p>First we need to be a bit more specific about the form that \( Q(\boldsymbol{h}|\boldsymbol{x}) \)
will take. The usual choice is to say that
\( Q(z|X)=\mathcal{N}(z|\mu(X;\vartheta),\Sigma(X;\vartheta)) \), where
\( Q(\boldsymbol{h}|\boldsymbol{x})=\mathcal{N}(\boldsymbol{h}|\mu(\boldsymbol{x};\vartheta),\Sigma(;\vartheta)) \), where
\( \mu \) and \( \Sigma \) are arbitrary deterministic functions with
parameters \( \vartheta \) that can be learned from data (we will omit
\( \vartheta \) in later equations). In practice, \( \mu \) and \( \Sigma \) are
Expand All @@ -826,10 +829,10 @@ <h2 id="setting-up-sgd">Setting up SGD </h2>
<h2 id="more-on-the-sgd">More on the SGD </h2>

<p>The name variational &quot;autoencoder&quot; comes from
the fact that \( \mu \) and \( \Sigma \) are &quot;encoding&quot; \( X \) into the latent
space \( z \). The advantages of this choice are computational, as they
the fact that \( \mu \) and \( \Sigma \) are &quot;encoding&quot; \( \boldsymbol{x} \) into the latent
space \( \boldsymbol{h} \). The advantages of this choice are computational, as they
make it clear how to compute the right hand side. The last
term---\( \mathcal{D}\left[Q(z|X)\|P(z)\right] \)---is now a KL-divergence
term---\( \mathcal{D}\left[Q(\boldsymbol{h}|\boldsymbol{x})\|p(\boldsymbol{h})\right] \)---is now a KL-divergence
between two multivariate Gaussian distributions, which can be computed
in closed form as:
</p>
Expand Down
Loading

0 comments on commit ed8b8f9

Please sign in to comment.