A collection of notes on PyTorch Scholarship Challenge 2018/2019.
Contributions are always welcome!
- AMA
- Lesson 2: Introduction to Neural Network
- Lectures
- Classification Problems
- Decision Boundary
- Perceptrons
- Why "Neural Networks"?
- Perceptrons as Logical Operators
- Perceptron Trick
- Perceptron Algorithm
- Non-Linear Regions
- Error Functions
- Log-loss Error Function
- Discrete vs Continous
- Softmax
- One-Hot Encoding
- Maximum Likelihood
- Cross-Entropy
- Multi-Class Cross Entropy
- Logistic Regression
- Gradient Descent
- Feedforward
- Backpropagation
- Overfitting & Underfitting
- Early Stopping
- Regularization
- Dropout
- Local Minima
- Random Restart
- Momentum
- Quizes
- Notebooks
- Lectures
- Lesson 3: Talking PyTorch with Soumith Chintala
- Lesson 4: Introduction to PyTorch
- Lesson 5 : Convolutional Neural Networks
- Lectures
- Applications of CNNs
- Lesson Outline
- MNIST Dataset
- How Computers Interpret Images
- MLP (Multi Layer Perceptron) Structure & Class Scores
- Do Your Research
- Loss & Optimization
- Defining a Network in PyTorch
- Training the Network
- One Solution
- Model Validation
- Validation Loss
- Image Classification Steps
- MLPs vs CNNs
- Local Connectivity
- Filters and the Convolutional Layer
- Filters & Edges
- Frequency in Images
- High-pass Filters
- OpenCV & Creating Custom Filters
- Convolutional Layer
- Convolutional Layers (Part 2)
- Stride and Padding
- Pooling Layers
- Increasing Depth
- CNNs for Image Classification
- Convolutional Layers in PyTorch
- Feature Vector
- CIFAR Classification Example
- Image Augmentation
- Groundbreaking CNN Architectures
- Visualizing CNNs (Part 1)
- Visualizing CNNs (Part 2)
- Summary of CNNs
- Quizes
- Notebooks
- Lectures
- Lesson 6: Style Transfer
- Lesson 7: Recurrent Neural Networks
- Lesson 8: Sentiment Prediction with RNNs
- Lesson 9: Deploying PyTorch Models
- Challenge Project
- Credits
The problem of identifying to which of a set of categories (sub-populations) a new observation belongs.
The separator between classes learned by a model in a binary class or multi-class classification problems. For example, in the following image representing a binary classification problem, the decision boundary is the frontier between the blue class and the red class:
A system (either hardware or software) that takes in one or more input values, runs a function on the weighted sum of the inputs, and computes a single output value. In machine learning, the function is typically nonlinear, such as ReLU, sigmoid, or tanh.
In the following illustration, the perceptron takes n inputs, each of which is itself modified by a weight before entering the perceptron:
A perceptron that takes in n inputs, each multiplied by separate weights. The perceptron outputs a single value.
Perceptrons are the (nodes) in deep neural networks. That is, a deep neural network consists of multiple connected perceptrons, plus a backpropagation algorithm to introduce feedback.
-
AND Perceptron
-
OR Perceptron
-
NOT Perceptron Unlike the other perceptrons we looked at, the NOT operation only cares about one input. The operation returns a 0 if the input is 1 and a 1 if it's a 0. The other inputs to the perceptron are ignored.
-
XOR Perceptron
A function that provides probabilities for each possible class in a multi-class classification model. The probabilities add up to exactly 1.0. For example, softmax might determine that the probability of a particular image being a duck at 0.67, a beaver at 0.33, and a walrus at 0. (Also called full softmax.)
A sparse vector in which:
- One element is set to 1.
- All other elements are set to 0.
One-hot encoding is commonly used to represent strings or identifiers that have a finite set of possible values. For example, suppose a given botany data set chronicles 15,000 different species, each denoted with a unique string identifier. As part of feature engineering, you'll probably encode those string identifiers as one-hot vectors in which the vector has a size of 15,000.
A generalization of Log Loss to multi-class classification problems. Cross-entropy quantifies the difference between two probability distributions.
A model that generates a probability for each possible discrete label value in classification problems by applying a sigmoid function to a linear prediction. Although logistic regression is often used in binary classification problems, it can also be used in multi-class classification problems (where it becomes called multi-class logistic regression or multinomial regression).
A technique to minimize loss by computing the gradients of loss with respect to the model's parameters, conditioned on training data. Informally, gradient descent iteratively adjusts parameters, gradually finding the best combination of weights and bias to minimize loss.
The primary algorithm for performing gradient descent on neural networks. First, the output values of each node are calculated (and cached) in a forward pass. Then, the partial derivative of the error with respect to each parameter is calculated in a backward pass through the graph.
Creating a model that matches the training data so closely that the model fails to make correct predictions on new data.
Producing a model with poor predictive ability because the model hasn't captured the complexity of the training data. Many problems can cause underfitting, including:
- Training on the wrong set of features.
- Training for too few epochs or at too low a learning rate.
- Training with too high a regularization rate.
- Providing too few hidden layers in a deep neural network.
A method for regularization that involves ending model training before training loss finishes decreasing. In early stopping, you end model training when the loss on a validation data set starts to increase, that is, when generalization performance worsens.
The penalty on a model's complexity. Regularization helps prevent overfitting. Different kinds of regularization include:
- L1 regularization
- L2 regularization
- dropout regularization
- early stopping (this is not a formal regularization method, but can effectively limit overfitting)
A form of regularization useful in training neural networks. Dropout regularization works by removing a random selection of a fixed number of the units in a network layer for a single gradient step. The more units dropped out, the stronger the regularization. This is analogous to training the network to emulate an exponentially large ensemble of smaller networks.
A sophisticated gradient descent algorithm in which a learning step depends not only on the derivative in the current step, but also on the derivatives of the step(s) that immediately preceded it. Momentum involves computing an exponentially weighted moving average of the gradients over time, analogous to momentum in physics. Momentum sometimes prevents learning from getting stuck in local minima.
- Soumith Chintala always wanted to be a visual effects artist at least when he started his undergrad and then he interned at a place and they said he's not good enough
- He was good at programming since he was a kid
- He try to find the next most magical thing and that was computer vision
- He had to find a professor in India which is really hard to afford who's doing this kind of stuff and it was just like one or two and He spent six months with the professor's lab
- He started picking up some things then went to CMU tried his hand at robotics and then finally landed at NYU and Yann LeCun's lab doing deep learning
- He got to NYU, he've been working on building up tooling.
- He worked on this project called EB learn which was like two generations before in terms of deep learning
- Then came around torch which is written by a few people
- He started getting pretty active and helping people out using torch and then developing a torch
- At some point we decided that we needed a new tool because all the as the field moves
- I went about building PyTorch mostly because we had a really stressful project that was really large and hard to build
- We started with just three of us and then we got other people interested
- About eight or nine people joined in part-time just adding feature and then slowly and steadily we started giving access to other people
- every week we would give access to about like ten people
- and then in Jan be released by doors to the public
- if you have a non-contiguous tensor and sent it through a linear layer it will just give you garbage
- a trade-off there where the readability comes at a cost of it being a little bit slow
- it should be very imperative very usable very pythonic but at the same time as fast as any other framework
- the consequences of that was like large parts of PyTorch live in C++ except whatever is user-facing
- you can attach your debugger you can print, those are still very very hackable
- we gave it a bunch of researchers and we took a rapid feedback from them and improve the product before it became mature so the core design of PyTorch is very very researcher friendly
- PyTorch is designed with users and just their feedback in mind
- PyTorch especially in its latest version sort of does also add features that make it easier to deploy models to production
- We built PyTorch event geared for production is you do research but when you want it to be production ready you just add functional annotations to your model which are like these one-liners that are top of a function
- We called a new programming model hybrid front-end because you can make parts of a model like compiled parts of my model and gives you the best of both worlds
- one paper written by one person Andy Brock it was called smash where one neural network would generate the weights that would be powered
- hierarchical story generation so you would see a story with like hey I want a story of a boy swimming in a pond and then it would actually like generate a story that's interesting with that plot
- openly available github repositories, it's also just like very readable of work where you look at something you can clearly see like here are the inputs here is what's happening as far as it being transformed and here are the desired outputs
- what users are wanting especially with being able to put models to production
- when they're exploring new ideas they don't want to be seeing like a 10x drop in performance
- online courses they want more interactive tutorials like based on a Python notebooks
- some widgets they want first-class integration with collab
- I sort of think of it as being a separate entity from from Facebook which i think you know it definitely has its own life and community
- we also have a huge set of needs for products at Facebook whether it's our camera enhancements or whether it is our machine translation or whether it's our accessibility interfaces or our integrity filtering
- the next thing I was thinking was deep learning itself is becoming a very pervasive and essential confident in many other fields
- Ethos that that as students are yet trying to get into the field of deep learning either to apply it to their own stuff or just to learn the concepts it's very important to make sure you do it from day one
- my only advice to people is to make sure you do lesser but like do it hands-on
- tensor
The primary data structure in TensorFlow programs. Tensors are N-dimensional (where N could be very large) data structures, most commonly scalars, vectors, or matrices. The elements of a Tensor can hold integer, floating-point, or string values.
- hyperparameter
The "knobs" that you tweak during successive runs of training a model. For example, learning rate is a hyperparameter.
- neural network
A model that, taking inspiration from the brain, is composed of layers (at least one of which is hidden) consisting of simple connected units or neurons followed by nonlinearities.
- MNIST (Modified National Institute of Standards and Technology database)
A public-domain data set compiled by LeCun, Cortes, and Burges containing 60,000 images, each image showing how a human manually wrote a particular digit from 0–9. Each image is stored as a 28x28 array of integers, where each integer is a grayscale value between 0 and 255, inclusive.
- activation function
A function (for example, ReLU or sigmoid) that takes in the weighted sum of all of the inputs from the previous layer and then generates and passes an output value (typically nonlinear) to the next layer.
- backpropagation
The primary algorithm for performing gradient descent on neural networks. First, the output values of each node are calculated (and cached) in a forward pass. Then, the partial derivative of the error with respect to each parameter is calculated in a backward pass through the graph.
- batch
The set of examples used in one iteration (that is, one gradient update) of model training.
- batch size
The number of examples in a batch. For example, the batch size of SGD is 1, while the batch size of a mini-batch is usually between 10 and 1000. Batch size is usually fixed during training and inference;
- cross-entropy
A generalization of Log Loss to multi-class classification problems. Cross-entropy quantifies the difference between two probability distributions
- epoch
A full training pass over the entire data set such that each example has been seen once. Thus, an epoch represents N/batch size training iterations, where N is the total number of examples.
- hidden layer
A synthetic layer in a neural network between the input layer (that is, the features) and the output layer (the prediction). Hidden layers typically contain an activation function (such as ReLU) for training. A deep neural network contains more than one hidden layer.
- logits
The vector of raw (non-normalized) predictions that a classification model generates, which is ordinarily then passed to a normalization function. If the model is solving a multi-class classification problem, logits typically become an input to the softmax function. The softmax function then generates a vector of (normalized) probabilities with one value for each possible class.
- optimizer
A specific implementation of the gradient descent algorithm.
- step
A forward and backward evaluation of one batch. step size Synonym for learning rate.
- stochastic gradient descent (SGD)
A gradient descent algorithm in which the batch size is one. In other words, SGD relies on a single example chosen uniformly at random from a data set to calculate an estimate of the gradient at each step.
- dropout regularization
A form of regularization useful in training neural networks. Dropout regularization works by removing a random selection of a fixed number of the units in a network layer for a single gradient step. The more units dropped out, the stronger the regularization. This is analogous to training the network to emulate an exponentially large ensemble of smaller networks.
- inference
In machine learning, often refers to the process of making predictions by applying the trained model to unlabeled examples. In statistics, inference refers to the process of fitting the parameters of a distribution conditioned on some observed data. (See the Wikipedia article on statistical inference.)
- overfitting
Creating a model that matches the training data so closely that the model fails to make correct predictions on new data.
- precision
A metric for classification models. Precision identifies the frequency with which a model was correct when predicting the positive class.
- recall
A metric for classification models that answers the following question: Out of all the possible positive labels, how many did the model correctly identify?
- validation set
A subset of the data set—disjunct from the training set—that you use to adjust hyperparameters.
- checkpoint
Data that captures the state of the variables of a model at a particular time. Checkpoints enable exporting model weights, as well as performing training across multiple sessions. Checkpoints also enable training to continue past errors (for example, job preemption). Note that the graph itself is not included in a checkpoint.
- Make use of the .shape method during debugging and development.
- Make sure you're clearing the gradients in the training loop with
optimizer.zero_grad()
. - If you're doing a validation loop, be sure to set the network to evaluation mode with
model.eval()
, then back to training mode withmodel.train()
. - If you're trying to run your network on the GPU, check to make sure you've moved the model and all necessary tensors to the GPU with
.to(device)
where device is either"cuda"
or"cpu"
- Tensors in PyTorch
- Neural networks with PyTorch
- Training Neural Networks
- Classifying Fashion-MNIST
- Inference and Validation
- Saving and Loading Models
- Loading Image Data
- Transfer Learning
- install pytorch
# http://pytorch.org/
from os.path import exists
from wheel.pep425tags import get_abbr_impl, get_impl_ver, get_abi_tag
platform = '{}{}-{}'.format(get_abbr_impl(), get_impl_ver(), get_abi_tag())
cuda_output = !ldconfig -p|grep cudart.so|sed -e 's/.*\.\([0-9]*\)\.\([0-9]*\)$/cu\1\2/'
accelerator = cuda_output[0] if exists('/dev/nvidia0') else 'cpu'
!pip install -q http://download.pytorch.org/whl/{accelerator}/torch-0.4.1-{platform}-linux_x86_64.whl torchvision
import torch
- download dataset
!wget -c https://s3.amazonaws.com/content.udacity-data.com/nd089/Cat_Dog_data.zip;
!unzip -qq Cat_Dog_data.zip;
!wget -c https://raw.githubusercontent.com/udacity/deep-learning-v2-pytorch/master/intro-to-pytorch/helper.py
- other dependencies
!pip install Pillow==4.0.0
!pip install PIL
!pip install image
import PIL
- Click
Runtime
-change Run time type
- Click
GPU
- WaveNet
- Text Classification
- Language Translation
- Play Atari games
- Play Pictionary
- Play Go
- CNNs powered Drone
- Self-Driving Car
- Predict depth from a single image
- Localize breast cancer
- Save endangered species
- Face App
- About CNN (Convolutional Neural Network) and how they improve our ability to classify images
- How CNN identify features and how CNN can be used for image classification
- Various layer that make up a complete CNN
- A feature is to think about what we are visually drawn to when we first see an object and when we identify different objects. For example what do we look at to distinguish a cat and a dog? The shape of the eyes, the size, and how they move
- Most famous database
-
Data normalization is an important pre-processing step. It ensures that each input (each pixel value, in this case) comes from a standard distribution.
A set of neurons in a neural network that process a set of input features, or the output of those neurons.
Layers are Python functions that take Tensors and configuration options as input and produce other tensors as output. Once the necessary Tensors have been composed, the user can convert the result into an Estimator via a model function.
- class
One of a set of enumerated target values for a label. For example, in a binary classification model that detects spam, the two classes are spam and not spam. In a multi-class classification model that identifies dog breeds, the classes would be poodle, beagle, pug, and so on.
- scoring
The part of a recommendation system that provides a value or ranking for each item produced by the candidate generation phase.
- More hidden layers generally means more ability to recognize complex pattern
- One or two hidden layers should work fine for small images
- Keep looking for a resource or two that appeals to you
- Try out the models in code
- Rectified Linear Unit (ReLU)
An activation function with the following rules:
- If input is negative or zero, output is 0.
- If input is positive, output is equal to input.
The steps for training/learning from a batch of data are described in the comments below:
- Clear the gradients of all optimized variables
- Forward pass: compute predicted outputs by passing inputs to the model
- Calculate the loss
- Backward pass: compute gradient of the loss with respect to model parameters
- Perform a single optimization step (parameter update)
- Update average training loss
model.eval()
will set all the layers in your model to evaluation mode.- This affects layers like dropout layers that turn "off" nodes during training with some probability, but should allow every node to be "on" for evaluation.
- So, you should set your model to evaluation mode before testing or validating your model and set it to
model.train()
(training mode) only during the training loop.
- We create a validation set to:
- Measure how well a model generalizes, during training
- Tell us when to stop training a model; when the validation loss stops decreasing (and especially when the validation loss starts increasing and the training loss is still decreasing)
- MNIST already centered, real image can be any position
- Difference between MLP vs CNN
- Sparsely connected layer
- CNN is special kind of NN that can remember spatial information
- The key to remember spatial information is convolutional layer, which apply series of different image filters (convolutional kernels) to input image
- CNN should learn to identify spatial patterns like curves and lines that make up number six
-
Intensity is a measure of light and dark, similiar to brightness
-
To identify the edges of an object, look at abrupt changes in intensity
-
Filters
To detect changes in intensity in an image, look at groups of pixels and react to alternating patterns of dark/light pixels. Producing an output that shows edges of objects and differing textures.
-
Edges
Area in images where the intensity changes very quickly
- Frequency in images is a rate of change.
- on the scarf and striped shirt, we have a high-frequency image pattern
- parts of the sky and background that change very gradually, which is considered a smooth, low-frequency pattern
- High-frequency components also correspond to the edges of objects in images, which can help us classify those objects.
- Edge Handling
- Extend Corner pixels are extended in 90° wedges. Other edge pixels are extended in lines.
- Padding The image is padded with a border of 0's, black pixels.
- Crop Any pixel in the output image which would require values from beyond the edge is skipped.
- OpenCV is a computer vision and machine learning software library that includes many common image analysis algorithms that will help us build custom, intelligent computer vision applications.
A layer of a deep neural network in which a convolutional filter passes along an input matrix. For example, consider the following 3x3 convolutional filter:
The following animation shows a convolutional layer consisting of 9 convolutional operations involving the 5x5 input matrix. Notice that each convolutional operation works on a different 3x3 slice of the input matrix. The resulting 3x3 matrix (on the right) consists of the results of the 9 convolutional operations:
-
convolutional neural network
A neural network in which at least one layer is a convolutional layer. A typical convolutional neural network consists of some combination of the following layers:
- convolutional layers
- pooling layers
- dense layers
Convolutional neural networks have had great success in certain kinds of problems, such as image recognition.
- Grayscale image -> 2D Matrix
- Color image -> 3 layers of 2D Matrix, one for each channel (Red, Green, Blue)
- Increase the number of node in convolutional layer -> increase the number of filter
- increase the size of detected pattern -> increase the size of filter
- Stride is the amount by which the filter slides over the image
- Size of convolutional layer depend on what we do at the edge of our image
- Padding give filter more space to move by padding zeros to the edge of image
-
pooling
Reducing a matrix (or matrices) created by an earlier convolutional layer to a smaller matrix. Pooling usually involves taking either the maximum or average value across the pooled area. For example, suppose we have the following 3x3 matrix:
A pooling operation, just like a convolutional operation, divides that matrix into slices and then slides that convolutional operation by strides. For example, suppose the pooling operation divides the convolutional matrix into 2x2 slices with a 1x1 stride. As the following diagram illustrates, four pooling operations take place. Imagine that each pooling operation picks the maximum value of the four in that slice:
Pooling helps enforce translational invariance in the input matrix.
Pooling for vision applications is known more formally as spatial pooling. Time-series applications usually refer to pooling as temporal pooling. Less formally, pooling is often called subsampling or downsampling.
- Incresing depth is actually:
- extracting more and more complex pattern and features that help identify the content and the objects in an image
- discarding some spatial information abaout feature like a smooth background that don't help identify the image
- init
self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size, stride=1, padding=0)
- forward
x = F.relu(self.conv1(x))
-
arguments
in_channels
- number of inputs (in depth)out_channels
- number of output channelskernel_size
- height and width (square) of convolutional kernelstride
- default1
padding
- default0
- documentation
-
pooling layers
down sampling factors
self.pool = nn.MaxPool2d(2,2)
- forward
x = F.relu(self.conv1(x)) x = self.pool(x)
- example #1
self.conv1 = nn.Conv2d(1, 16, 2, stride=2)
-
grayscale images (1 depth)
-
16 filter
-
filter size 2x2
-
filter jump 2 pixels at a time
-
example #2
self.conv2 = nn.Conv2d(16, 32, 3, padding=1)
- 16 input from output of example #1
- 32 filters
- filter size 3x3
- jump 1 pixel at a time
-
sequential models
def __init__(self): super(ModelName, self).__init__() self.features = nn.Sequential( nn.Conv2d(1, 16, 2, stride=2), nn.MaxPool2d(2, 2), nn.ReLU(True), nn.Conv2d(16, 32, 3, padding=1), nn.MaxPool2d(2, 2), nn.ReLU(True) )
-
formula: number of parameters in a convolutional layer
K
- number of filterF
- filter sizeD_in
- last value in theinput shape
(K * F*F * D_in) + K
-
formula: shape of a convolutional layer
K
- number of filterF
- filter sizeS
- strideP
- paddingW_in
- size of prev layer
((W_in - F + 2P) / S) + 1
-
-
flattening
to make all parameters can be seen (as a vector) by a linear classification layer
- a representation that encodes only the content of the image
- often called a feature level representation of an image
- CIFAR-10 (Canadian Institute For Advanced Research) is a popular dataset of 60,000 tiny images
-
data augmentation
Artificially boosting the range and number of training examples by transforming existing examples to create additional examples. For example, suppose images are one of your features, but your data set doesn't contain enough image examples for the model to learn useful associations. Ideally, you'd add enough labeled images to your data set to enable your model to train properly. If that's not possible, data augmentation can rotate, stretch, and reflect each image to produce many variants of the original picture, possibly yielding enough labeled data to enable excellent training.
-
translational invariance
In an image classification problem, an algorithm's ability to successfully classify images even when the position of objects within the image changes. For example, the algorithm can still identify a dog, whether it is in the center of the frame or at the left end of the frame.
-
size invariance
In an image classification problem, an algorithm's ability to successfully classify images even when the size of the image changes. For example, the algorithm can still identify a cat whether it consumes 2M pixels or 200K pixels. Note that even the best image classification algorithms still have practical limits on size invariance. For example, an algorithm (or human) is unlikely to correctly classify a cat image consuming only 20 pixels.
-
rotational invariance
In an image classification problem, an algorithm's ability to successfully classify images even when the orientation of the image changes. For example, the algorithm can still identify a tennis racket whether it is pointing up, sideways, or down. Note that rotational invariance is not always desirable; for example, an upside-down 9 should not be classified as a 9.
- Since 2010, ImageNet project has held the ImageNet Large Scale Visual Recognition Competition, annual competition for the best CNN for object recognition and classification
- First breakthrough was in 2012, the network called AlexNet was developed by a team at the University of Toronto, they pioneered the use of the ReLU activation function and dropout as a technicque for avoiding overfitting
- 2014 winner was VGGNet often reffered to as just VGG (Visual Geometry Group) at Oxford University, has two version VGG 16 and VGG 19
- 2015 winner was Microsoft Research called ResNet, like VGG, largest groundbreaking has 152 layers, can solve vanishing gradient problem, achieves superhuman performances in classifying images in ImageNet database
- visualizing the activation maps and convolutional layers
- taking filter from convolutional layers and constructing images that maximize their activations, google researchers get creative with this and designed technique called deep dreams
- say we have picture of tree, investigate filter for detecting a building, end up creating image that looks like some sort of tree or building hybrid
- based on paper by Zeiler and Fergus, visualization using this toolbox.
- Layer 1 - pick out very simple shapes and patterns like lines and blobs
- Layer 2 - circle, stripes and rectangle
- Layer 3 - complex combinations of features from the second layer
- Layer 4 - continue progression
- Layer 5 - classification
- take input image then puts image through several convolutional and pooling layers
- result is a set of feature maps reduced in size from the original image
- flatten these maps, creating feature vector that can be passed to series of fully connected linear layer to produce probability distribution of class course
- from thes predicted class label can be extracted
- CNN not restricted to the image calssification task, can be applied to any task with a fixed number of outputs such as regression tasks that look at points on a face or detect human poses
- Q: In the case of our 28x28 images, how many entries will the corresponding, image vector have when this matrix is flattened?
- A:
784
- E:
28*28*1 values = 784
- Q: After looking at existing work, how many hidden layers will you use in your MLP for image classification?
- A: 2
- E: There is not one correct answer here, but one or two hidden layers should work fine for this simple task, and it's always good to do your research!
- Q: Of the four kernels pictured above, which would be best for finding and enhancing horizontal edges and lines in an image?
- A:
d
- E: This kernel finds the difference between the top and bottom edges surrounding a given pixel.
- Q: How might you define a Maxpooling layer, such that it down-samples an input by a factor of 4?
- A:
nn.MaxPool2d(2,4)
,nn.MaxPool2d(4,4)
- E: The best choice would be to use a kernel and stride of 4, so that the maxpooling function sees every input pixel once, but any layer with a stride of 4 will down-sample an input by that factor.
or the following quiz questions, consider an input image that is 130x130 (x, y) and 3 in depth (RGB). Say, this image goes through the following layers in order:
nn.Conv2d(3, 10, 3)
nn.MaxPool2d(4, 4)
nn.Conv2d(10, 20, 5, padding=2)
nn.MaxPool2d(2, 2)
-
Q: After going through all four of these layers in sequence, what is the depth of the final output?
-
A:
20
-
E: the final depth is determined by the last convolutional layer, which has a
depth
=out_channels
= 20. -
Q: What is the x-y size of the output of the final maxpooling layer? Careful to look at how the 130x130 image passes through (and shrinks) as it moved through each convolutional and pooling layer.
-
A: 16
-
E: The 130x130 image shrinks by one after the first convolutional layer, then is down-sampled by 4 then 2 after each successive maxpooling layer!
((W_in - F + 2P) / S) + 1
((130 - 3 + 2*0) / 1) + 1 = 128 128 / 4 = 32 ((32 - 5 + 2*2) / 1) + 1 = 32 32 / 2 = 16
-
Q: How many parameters, total, will be left after an image passes through all four of the above layers in sequence?
-
A:
16*16*20
-
E: It's the x-y size of the final output times the number of final channels/depth =
16*16 * 20
.
- Multi-Layer Perceptron, MNIST
- Multi-Layer Perceptron, MNIST (With Validation)
- Creating a Filter, Edge Detection
- Convolutional Layer
- Maxpooling Layer
- Convolutional Neural Networks
- Convolutional Neural Networks - Image Augmentation
- apply the style of one image to another image
- feature space designed to capture texture and color information used, essentially looks at spatial correlations within a layer of a network
- correlation is a measure of the relationship between two or more variables
- similarities and differences between features in a layer should give some information about texture and color information found in an image, but at the same time leave out information about the actual arrangement and identitity of different objects in that image
- VGG19 -> 19 layer VGG network
- When the network sees the content image, it will go through feed-forward process until it gets to a conv layer that is deep in the network, the output will be the content representation
- When it sees tye style image, it will extract different features from multiple layers that represent the style of that image
- content loss is a loss that calculates the difference between the content (Cc) and target (Tc) image representation
- Correlations at each layer in convolutional layer are given by a Gram matrix
- First step in calculating the Gram matrix, will be to vectorize the values of feature map
- By flattening the XY dimensions of the feature maps, we're convrting a 3D conv layer to a 2D matrix of values
- The next step is to multiply vectorized feature map by its transpose to get the gram matrix
- content loss is a loss that calculates the difference between the image style (Ss) and target (Ts) image style,
a
is constant that accounts for the number of values in each layer,w
is style weights
- Add together content loss and style loss to get total loss and then use typical back propagation and optimization to reduce total loss
- alpha beta ratio is ratio between alpha (content weight) and beta (style weight)
- Different alpha beta ratio can result in different generated image
- Q: Given a convolutional layer with dimensions
d x h x w = (20*8*8)
, what length will one row of the vectorized convolutional layer have? (Vectorized means that the spatial dimensions are flattened.) - A:
64
- E: When the height and width (8 x 8) are flattened, the resultant 2D matrix will have as many columns as the height and width, multiplied:
8*8 = 64
.
- Q: Given a convolutional layer with dimensions
d x h x w = (20*8*8)
, what dimensions (h x w) will the resultant Gram matrix have? - A:
(20 x 20)
- E: The Gram matrix will be a square matrix, with a width and height = to the depth of the convolutional layer in question.
-
RNN (R ecurrent N eural N etworks)
A neural network that is intentionally run multiple times, where parts of each run feed into the next run. Specifically, hidden layers from the previous run provide part of the input to the same hidden layer in the next run. Recurrent neural networks are particularly useful for evaluating sequences, so that the hidden layers can learn from previous runs of the neural network on earlier parts of the sequence.
For example, the following figure shows a recurrent neural network that runs four times. Notice that the values learned in the hidden layers from the first run become part of the input to the same hidden layers in the second run. Similarly, the values learned in the hidden layer on the second run become part of the input to the same hidden layer in the third run. In this way, the recurrent neural network gradually trains and predicts the meaning of the entire sequence rather than just the meaning of individual words.
-
LSTM (L ong S hort - T erm M emory)
LSTM are an improvement of the RNN, and quite useful when needs to switch between remembering recent things, and things from long time ago
- RNN work as follows:
- memory comes in an merges with a current event
- and the output comes out as a prediction of what the input is
- as part of the input for the next iteration of the neural network
- RNN has problem with the memory that is short term memory
- LSTM works as follows:
- keeps track long term memory which comes in an comes out
- and short term memory which also comes in and comes out
- From there, we get a new long term memory, short term memory and a prediction. In here, we protect old information more.
-
Architecture of LSTM
-
forget gate
long term memory (LTM) goes here where it forgets everything that it doesn't consider useful
-
learn gate
short term memory and event are joined together containing information that have recently learned and it removes any unecessary information
-
remember gate
long term memory that haven't forgotten yet plus the new information that have learned get joined together to update long term memmory
-
use gate
decide what information use from what previously know plus what we just learned to make a prediction. The output becomes both the prediction and the new short term memory (STM)
-
- RNN Architecture
- LSTM Architecture
- Learn gate works as follows:
- Take STM and the event and jonis it (use tanh activation function)
- then ignore (ignore factor) a bit to keep the important part of it (use sigmoid activation function)
- Forget gate works as follows:
- Take LTM and decides what parts to keep and to forget (forget factor, use sigmoid activation function)
- Remember gate works as follows:
- Take LTM coming out of forget gate and STM coming out of learn gate and combine them together
- Remember gate works as follows:
- Take LTM coming out of forget gate (apply tanh) and STM coming out of learn gate (apply sigmoid) to come up with a new STM and an output (multiply them together)
- GRU (G ated R ecurrent U nit)
- combine forget and learn gate into an update gate
- run this through a combine gate
- only returns one working memory
- Peephole Connections
- forget gate which also connect LTM into neural network that calculates forget factor
- LSTM with Peephole Connection
- do peephole connection for every one of forget-type nodes
- Network will learn about some text, one character at a time and then generate new text one character at a time
- The architecture is as follows:
- input layer will pass the characters as one hot encoded vectors
- these vectors go to the hidden layer which built with LSTM cells
- the output layer is used to predict to the next character (using softmax activation)
- use cross entropy loss for training with gradient descent
- Use matrix operations to make training mor efficient
- RNN training multiple sequence in parallel, for example:
- sequence of numbers from 1 to 12
- split it in half and pass in two sequences
- batch size corresponds to the number of sequences we're using, here we'd say the batch size is 2
- we can retain hidden state from one batch and use it at the start of the next batch
- Q: Say you've defined a GRU layer with
input_size = 100
,hidden_size = 20
, andnum_layers=1
. What will the dimensions of the hidden state be if you're passing in data, batch first, in batches of 3 sequences at a time? - A:
(1, 3, 20)
- E: The hidden state should have dimensions:
(num_layers, batch_size, hidden_dim)
.
- PyTorch popular in research settings due to:
- flexibility
- expressiveness
- ease of development
- Adoption has been slow in industry because wasn't as useful in production environments which typically require models to run in C++
- Go to Getting Started page, configure and run the install command
- Minimum requirement is PyTorch 1.0 to use TorchScript and tracing features.
- PyTorch 1.0 has been specifically built for making transition between developing model in Python and converting it into a module that can be loaded into a C++ environment
- tracing
- map out the structure of model by passing an example tensor through it,
- behind the scene PyTorch keeping track of all the operations that being performed on the inputs.
- this way, it can actually build a static graph that can then be exported and loaded into C++.
- to do this we use JIT (Just In-Time) compiler
- See Torch Script
- torch script
- an intermediate representation that can be compiled and serialized by the Torch Script Compiler
- the workflows is as follows: develop model, setting hyper parameter, train, test, convert PyTorch model into Torch Script and compile to C++ representation
- two ways of converting PyTorch model to Torch Script
- tracing
- annotations
- Used in control flow that don't actually work with tracing method, for example use some if statements in forward method that depend on input.
- Use
torch.jit.ScriptModule
subclass and add@torch.jit.script_method
decorator to convert to script module - We can use
save
method to serialize script module to a file which can then be loaded into C++
- See PyTorch C++ API
- General workflow:
- building and defining model in Python with PyTorch
- training it there, and then once it's all trained
- convert it to a Script Module either with tracing or annotations
- then serialize it with the save method
- from there we can use the C++ API to load it into a C++ application
- Mhendri's Tips, Model Performance, Submission Troubleshooting
- Gabriele Picco's Deep Learning Flower Indentifier
- 102 Category Flower Dataset
- Jose Nieto's Implementing an Image Classifier with PyTorch
- How to move our model from Google Colab to Udacity’s Workspace
- Tips and tricks for a successful Udacity project checkpoint load
- Images taken from lectures videos at Intro to Deep Learning with PyTorch
- Machine Learning Glossary