Skip to content

Commit

Permalink
updating slides
Browse files Browse the repository at this point in the history
  • Loading branch information
mhjensen committed Jan 15, 2024
1 parent b8ab920 commit 6cd2e8f
Show file tree
Hide file tree
Showing 8 changed files with 516 additions and 851 deletions.
187 changes: 59 additions & 128 deletions doc/pub/week1/html/week1-bs.html

Large diffs are not rendered by default.

148 changes: 41 additions & 107 deletions doc/pub/week1/html/week1-reveal.html
Original file line number Diff line number Diff line change
Expand Up @@ -1481,24 +1481,21 @@ <h2 id="setting-up-the-equations-for-a-neural-network">Setting up the equations

<p>&nbsp;<br>
$$
{\cal C}(\hat{W}) = \frac{1}{2}\sum_{i=1}^n\left(y_i - t_i\right)^2,
{\cal C}(\boldsymbol{\Theta}) = \frac{1}{2}\sum_{i=1}^n\left(y_i - \tilde{y}_i\right)^2,
$$
<p>&nbsp;<br>

<p>where the $t_i$s are our \( n \) targets (the values we want to
<p>where the $y_i$s are our \( n \) targets (the values we want to
reproduce), while the outputs of the network after having propagated
all inputs \( \hat{x} \) are given by \( y_i \). Below we will demonstrate
how the basic equations arising from the back propagation algorithm
can be modified in order to study classification problems with \( K \)
classes.
all inputs \( \boldsymbol{x} \) are given by \( \boldsymbol{\tilde{y}}_i \).
</p>
</section>

<section>
<h2 id="definitions">Definitions </h2>

<p>With our definition of the targets \( \hat{t} \), the outputs of the
network \( \hat{y} \) and the inputs \( \hat{x} \) we
<p>With our definition of the targets \( \boldsymbol{y} \), the outputs of the
network \( \boldsymbol{\tilde{y}} \) and the inputs \( \boldsymbol{x} \) we
define now the activation \( z_j^l \) of node/neuron/unit \( j \) of the
\( l \)-th layer as a function of the bias, the weights which add up from
the previous layer \( l-1 \) and the forward passes/outputs
Expand All @@ -1522,9 +1519,13 @@ <h2 id="definitions">Definitions </h2>
\hat{z}^l = \left(\hat{W}^l\right)^T\hat{a}^{l-1}+\hat{b}^l.
$$
<p>&nbsp;<br>
</section>

<section>
<h2 id="inputs-to-tje-activation-function">Inputs to tje activation function </h2>

<p>With the activation values \( \hat{z}^l \) we can in turn define the
output of layer \( l \) as \( \hat{a}^l = f(\hat{z}^l) \) where \( f \) is our
<p>With the activation values \( \boldsymbol{z}^l \) we can in turn define the
output of layer \( l \) as \( \boldsymbol{a}^l = f(\boldsymbol{z}^l) \) where \( f \) is our
activation function. In the examples here we will use the sigmoid
function discussed in our logistic regression lectures. We will also use the same activation function \( f \) for all layers
and their nodes. It means we have
Expand Down Expand Up @@ -1570,22 +1571,22 @@ <h2 id="derivative-of-the-cost-function">Derivative of the cost function </h2>
<p>Let us specialize to the output layer \( l=L \). Our cost function is</p>
<p>&nbsp;<br>
$$
{\cal C}(\hat{W^L}) = \frac{1}{2}\sum_{i=1}^n\left(y_i - t_i\right)^2=\frac{1}{2}\sum_{i=1}^n\left(a_i^L - t_i\right)^2,
{\cal C}(\boldsymbol{\Theta}^L) = \frac{1}{2}\sum_{i=1}^n\left(y_i - \tilde{y}_i\right)^2=\frac{1}{2}\sum_{i=1}^n\left(a_i^L - y_i\right)^2,
$$
<p>&nbsp;<br>

<p>The derivative of this function with respect to the weights is</p>

<p>&nbsp;<br>
$$
\frac{\partial{\cal C}(\hat{W^L})}{\partial w_{jk}^L} = \left(a_j^L - t_j\right)\frac{\partial a_j^L}{\partial w_{jk}^{L}},
\frac{\partial{\cal C}(\boldsymbol{\Theta}^L)}{\partial w_{jk}^L} = \left(a_j^L - y_j\right)\frac{\partial a_j^L}{\partial w_{jk}^{L}},
$$
<p>&nbsp;<br>

<p>The last partial derivative can easily be computed and reads (by applying the chain rule)</p>
<p>&nbsp;<br>
$$
\frac{\partial a_j^L}{\partial w_{jk}^{L}} = \frac{\partial a_j^L}{\partial z_{j}^{L}}\frac{\partial z_j^L}{\partial w_{jk}^{L}}=a_j^L(1-a_j^L)a_k^{L-1},
\frac{\partial a_j^L}{\partial w_{jk}^{L}} = \frac{\partial a_j^L}{\partial z_{j}^{L}}\frac{\partial z_j^L}{\partial w_{jk}^{L}}=a_j^L(1-a_j^L)a_k^{L-1}.
$$
<p>&nbsp;<br>
</section>
Expand All @@ -1596,23 +1597,27 @@ <h2 id="bringing-it-together-first-back-propagation-equation">Bringing it togeth
<p>We have thus</p>
<p>&nbsp;<br>
$$
\frac{\partial{\cal C}(\hat{W^L})}{\partial w_{jk}^L} = \left(a_j^L - t_j\right)a_j^L(1-a_j^L)a_k^{L-1},
\frac{\partial{\cal C}((\boldsymbol{\Theta}^L)}{\partial w_{jk}^L} = \left(a_j^L - y_j\right)a_j^L(1-a_j^L)a_k^{L-1},
$$
<p>&nbsp;<br>

<p>Defining</p>
<p>&nbsp;<br>
$$
\delta_j^L = a_j^L(1-a_j^L)\left(a_j^L - t_j\right) = f'(z_j^L)\frac{\partial {\cal C}}{\partial (a_j^L)},
\delta_j^L = a_j^L(1-a_j^L)\left(a_j^L - y_j\right) = f'(z_j^L)\frac{\partial {\cal C}}{\partial (a_j^L)},
$$
<p>&nbsp;<br>

<p>and using the Hadamard product of two vectors we can write this as</p>
<p>&nbsp;<br>
$$
\hat{\delta}^L = f'(\hat{z}^L)\circ\frac{\partial {\cal C}}{\partial (\hat{a}^L)}.
\boldsymbol{\delta}^L = f'(\hat{z}^L)\circ\frac{\partial {\cal C}}{\partial (\boldsymbol{a}^L)}.
$$
<p>&nbsp;<br>
</section>

<section>
<h2 id="analyzing-the-last-results">Analyzing the last results </h2>

<p>This is an important expression. The second term on the right handside
measures how fast the cost function is changing as a function of the $j$th
Expand Down Expand Up @@ -1645,7 +1650,7 @@ <h2 id="more-considerations">More considerations </h2>
<p>With the definition of \( \delta_j^L \) we have a more compact definition of the derivative of the cost function in terms of the weights, namely</p>
<p>&nbsp;<br>
$$
\frac{\partial{\cal C}(\hat{W^L})}{\partial w_{jk}^L} = \delta_j^La_k^{L-1}.
\frac{\partial{\cal C}}{\partial w_{jk}^L} = \delta_j^La_k^{L-1}.
$$
<p>&nbsp;<br>
</section>
Expand Down Expand Up @@ -1676,10 +1681,6 @@ <h2 id="bringing-it-together">Bringing it together </h2>

<p>We have now three equations that are essential for the computations of the derivatives of the cost function at the output layer. These equations are needed to start the algorithm and they are</p>

<div class="alert alert-block alert-block alert-text-normal">
<b>The starting equations</b>
<p>

<p>&nbsp;<br>
$$
\begin{equation}
Expand Down Expand Up @@ -1709,7 +1710,6 @@ <h2 id="bringing-it-together">Bringing it together </h2>
\end{equation}
$$
<p>&nbsp;<br>
</div>
</section>

<section>
Expand All @@ -1722,8 +1722,13 @@ <h2 id="final-back-propagating-equation">Final back propagating equation </h2>
$$
<p>&nbsp;<br>

<p>We want to express this in terms of the equations for layer \( l+1 \). Using the chain rule and summing over all \( k \) entries we have</p>
<p>We want to express this in terms of the equations for layer \( l+1 \).</p>
</section>

<section>
<h2 id="using-the-chain-rule-and-summing-over-all-k-entries">Using the chain rule and summing over all \( k \) entries </h2>

<p>We obtain</p>
<p>&nbsp;<br>
$$
\delta_j^l =\sum_k \frac{\partial {\cal C}}{\partial z_k^{l+1}}\frac{\partial z_k^{l+1}}{\partial z_j^{l}}=\sum_k \delta_k^{l+1}\frac{\partial z_k^{l+1}}{\partial z_j^{l}},
Expand All @@ -1750,65 +1755,48 @@ <h2 id="final-back-propagating-equation">Final back propagating equation </h2>
</section>

<section>
<h2 id="setting-up-the-back-propagation-algorithm">Setting up the Back propagation algorithm </h2>
<h2 id="setting-up-the-back-propagation-algorithm">Setting up the back propagation algorithm </h2>

<p>The four equations provide us with a way of computing the gradient of the cost function. Let us write this out in the form of an algorithm.</p>

<div class="alert alert-block alert-block alert-text-normal">
<b></b>
<p>
<p>First, we set up the input data \( \hat{x} \) and the activations
<p><b>First</b>, we set up the input data \( \hat{x} \) and the activations
\( \hat{z}_1 \) of the input layer and compute the activation function and
the pertinent outputs \( \hat{a}^1 \).
</p>
</div>


<div class="alert alert-block alert-block alert-text-normal">
<b></b>
<p>
<p>Secondly, we perform then the feed forward till we reach the output
<p><b>Secondly</b>, we perform then the feed forward till we reach the output
layer and compute all \( \hat{z}_l \) of the input layer and compute the
activation function and the pertinent outputs \( \hat{a}^l \) for
\( l=2,3,\dots,L \).
</p>
</div>
</section>

<section>
<h2 id="setting-up-the-back-propagation-algorithm-part-2">Setting up the Back propagation algorithm, part 2 </h2>
<h2 id="setting-up-the-back-propagation-algorithm-part-2">Setting up the back propagation algorithm, part 2 </h2>

<div class="alert alert-block alert-block alert-text-normal">
<b></b>
<p>
<p>Thereafter we compute the ouput error \( \hat{\delta}^L \) by computing all</p>
<p>&nbsp;<br>
$$
\delta_j^L = f'(z_j^L)\frac{\partial {\cal C}}{\partial (a_j^L)}.
$$
<p>&nbsp;<br>
</div>


<div class="alert alert-block alert-block alert-text-normal">
<b></b>
<p>
<p>Then we compute the back propagate error for each \( l=L-1,L-2,\dots,2 \) as</p>
<p>&nbsp;<br>
$$
\delta_j^l = \sum_k \delta_k^{l+1}w_{kj}^{l+1}f'(z_j^l).
$$
<p>&nbsp;<br>
</div>
</section>

<section>
<h2 id="setting-up-the-back-propagation-algorithm-part-3">Setting up the Back propagation algorithm, part 3 </h2>

<div class="alert alert-block alert-block alert-text-normal">
<b></b>
<p>
<p>Finally, we update the weights and the biases using gradient descent for each \( l=L-1,L-2,\dots,2 \) and update the weights and biases according to the rules</p>
<p>Finally, we update the weights and the biases using gradient descent
for each \( l=L-1,L-2,\dots,1 \) and update the weights and biases
according to the rules
</p>

<p>&nbsp;<br>
$$
w_{jk}^l\leftarrow = w_{jk}^l- \eta \delta_j^la_k^{l-1},
Expand All @@ -1820,70 +1808,21 @@ <h2 id="setting-up-the-back-propagation-algorithm-part-3">Setting up the Back pr
b_j^l \leftarrow b_j^l-\eta \frac{\partial {\cal C}}{\partial b_j^l}=b_j^l-\eta \delta_j^l,
$$
<p>&nbsp;<br>
</div>

<p>The parameter \( \eta \) is the learning parameter discussed in connection with the gradient descent methods.
Here it is convenient to use stochastic gradient descent (see the examples below) with mini-batches with an outer loop that steps through multiple epochs of training.
</p>
</section>

<section>
<h2 id="setting-up-the-back-propagation-algorithm-final-considerations">Setting up the Back propagation algorithm, final considerations </h2>

<p>The four equations above provide us with a way of computing the gradient of the cost function. Let us write this out in the form of an algorithm.</p>

<div class="alert alert-block alert-block alert-text-normal">
<b></b>
<p>
<p>First, we set up the input data \( \boldsymbol{x} \) and the activations
\( \boldsymbol{z}_1 \) of the input layer and compute the activation function and
the pertinent outputs \( \boldsymbol{a}^1 \).
</p>
</div>


<div class="alert alert-block alert-block alert-text-normal">
<b></b>
<p>
<p>Secondly, we perform then the feed forward till we reach the output
layer and compute all \( \boldsymbol{z}_l \) of the input layer and compute the
activation function and the pertinent outputs \( \boldsymbol{a}^l \) for
\( l=2,3,\dots,L \).
</p>
</div>


<div class="alert alert-block alert-block alert-text-normal">
<b></b>
<p>
<p>Thereafter we compute the ouput error \( \boldsymbol{\delta}^L \) by computing all</p>
<p>&nbsp;<br>
$$
\delta_j^L = f'(z_j^L)\frac{\partial {\cal C}}{\partial (a_j^L)}.
$$
<p>&nbsp;<br>
</div>
<p>with \( \eta \) being the learning rate.</p>
</section>

<section>
<h2 id="updating-the-gradients">Updating the gradients </h2>

<div class="alert alert-block alert-block alert-text-normal">
<b></b>
<p>
<p>Then we compute the back propagate error for each \( l=L-1,L-2,\dots,2 \) as</p>
<p>With the back propagate error for each \( l=L-1,L-2,\dots,1 \) as</p>
<p>&nbsp;<br>
$$
\delta_j^l = \sum_k \delta_k^{l+1}w_{kj}^{l+1}f'(z_j^l).
\delta_j^l = \sum_k \delta_k^{l+1}w_{kj}^{l+1}f'(z_j^l),
$$
<p>&nbsp;<br>
</div>


<div class="alert alert-block alert-block alert-text-normal">
<b></b>
<p>
<p>Finally, we update the weights and the biases using gradient descent for each \( l=L-1,L-2,\dots,2 \) and update the weights and biases according to the rules</p>
<p>we update the weights and the biases using gradient descent for each \( l=L-1,L-2,\dots,1 \) and update the weights and biases according to the rules</p>
<p>&nbsp;<br>
$$
w_{jk}^l\leftarrow = w_{jk}^l- \eta \delta_j^la_k^{l-1},
Expand All @@ -1895,11 +1834,6 @@ <h2 id="updating-the-gradients">Updating the gradients </h2>
b_j^l \leftarrow b_j^l-\eta \frac{\partial {\cal C}}{\partial b_j^l}=b_j^l-\eta \delta_j^l,
$$
<p>&nbsp;<br>
</div>

<p>The parameter \( \eta \) is the learning parameter discussed in connection with the gradient descent methods.
Here it is convenient to use stochastic gradient descent (see the examples below) with mini-batches with an outer loop that steps through multiple epochs of training.
</p>
</section>


Expand Down
Loading

0 comments on commit 6cd2e8f

Please sign in to comment.