You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Understanding gradient descent and backpropagation. We will go through four different neural network examples and explicitly compute the backpropagation equations.
true
Introduction
When datasets are large and high-dimensional, it is computationally very expensive (sometimes impossible!) to find an analytical solution for the optimal parameters of your network. Instead, we use optimization methods. A vanilla optimization approach would be to sample different combinations of parameters and choose the one with the lowest loss value.
Is this a good idea?
Would it be possible to extract another piece of information to direct our search towards the optimal parameters?
{% include image.html description="The trajectory through a loss landscape for a linear regression model trained with gradient descent. The red dot indicates the value of the loss function corresponding to the initial parameter values." link="https://rawgit.com/danielkunin/Deeplearning-Visualizations/master/optimization/index.html" image="section/3/optimization.png" caption="true"%}
This is exactly what gradient descent does! Apart from the loss value, gradient descent computes the local gradient of the loss when evaluating potential parameters. This information is used to decide which direction the search should go to find better parameter values. This extra piece of information (the local gradient) can be computed relatively easily using backpropagation. This recursive algorithm breaks up complex derivatives into smaller parts through the chain rule.
To help understand gradient descent, let's visualize the setup.
Univariate Regression
Let’s consider a linear regression. You have a data set $$(x,y)$$ with $$m$$ examples. In other words, $$x = (x_1,\dots , x_m)$$ and $$y = (y_1, \dots, y_m)$$ are row vectors of $$m$$ scalar examples . The goal is to find the scalar parameters $$w$$ and $$b$$ such that the line $$y = wx + b$$ optimally fits the data. This can be achieved using gradient descent.
{% include image.html description="Univariate regression." link="" image="section/3/1/model.png" caption="false" %}
Forward Propagation
The first step of gradient descent is to compute the loss. To do this, define your model’s output and loss function. In this regression setting, we use the mean squared error loss.
The next step is to compute the local gradient of the loss with respect to the parameters (i.e. $$w$$ and $$b$$). This means you need to calculate derivatives. Note that values stored during the forward propagation are used in the gradient equations.
Now, consider the case where $$X$$ is a matrix of shape $$(n,m)$$ and $$y$$ is still a row vector of shape $$(1,m)$$. Instead of a single scalar value, the weights will be a vector (one element per feature) of shape $$(1,n)$$. The bias parameter is still a scalar.
{% include image.html description="Multivariate regression." link="" image="section/3/2/model.png" caption="false"%}
Consider stacking two linear layers together. You can introduce a hidden variable $$Z$$ of shape $$(k, m)$$, which is the output of the first linear layer. The first layer is parameterized by a weight matrix $$W_1$$ of shape $$(k,n)$$ and bias $$b_1$$ of shape $$(k, 1)$$ broadcasted to $$(k, m)$$. The second layer will be the same as in the multivariate regression case, but its input will be $$Z$$ instead of $$X$$.
{% include image.html description="" link="" image="section/3/3/model.png" caption="false"%}
Forward Propagation
$$\begin{align}
Z &= W_1 X + b_1\\\
\hat{y} &= w_2Z + b_2\\\
\mathcal{L} &= \frac{1}{m}||\hat{y} - y||^2
\end{align}$$
In this example, before sending $$Z$$ as the input to the second layer, you will pass it through the sigmoid function. The output is denoted $$A$$ and is the input of the second layer.
{% include image.html description="" link="" image="section/3/4/model.png" caption="false"%}
Forward Propagation
$$\begin{align}
Z &= W_1 X + b_1\\\
A &= \sigma(Z)\\\
\hat{y} &= w_2A + b_2\\\
\mathcal{L} &= \frac{1}{m}||\hat{y} - y||^2
\end{align}$$
Detailed visualizations of the backprop calculation for logistic regression and a 2-layer perceptron,
An overview of common optimization techniques: mini-batch GD, momentum, RMSprop, and Adam.
Exercises
We also put together a sheet of exercises (and their solutions) to help you test your understanding of gradient descent and backpropogation, as well as provide useful practice for the exam.