Acknowledgement:
The course website: MACHINE LEARNING 2022 SPRING
ML is about finding a function.
Besides regression and classification tasks, ML also focuses on Structured Learning -- create something with structure (image, document).
Three main ML tasks:
- Regression
- Classification
- Structured Learning
Guess a function (model)
Loss is a function itself of parameters, written as
An example loss function is:
where
Mean Absolute Error (MAE) $$ e_n = |y_n - \hat{y}_n| $$
Mean Square Error (MSE)
If
The goal is to find the best parameter:
$$ w^, b^ = \arg\min_{w,b} L $$
Solution: Gradient Descent
- Randomly pick an intitial value
$w^0$ - Compute
$\frac{\partial L}{\partial w} \bigg|_{w=w^0}$ (if it's negative, we should increase$w$ ; if it's positive, we should decrease$w$ ) - Perform update step:
$w^1 \leftarrow w^0 - \eta \frac{\partial L}{\partial w} \bigg|_{w=w^0}$ iteratively - Stop either when we've performed a maximum number of update steps or the update stops (
$\eta \frac{\partial L}{\partial w} = 0$ )
If we have two parameters:
-
(Randomly) Pick initial values
$w^0, b^0$ -
Compute:
$w^1 \leftarrow w^0 - \eta \frac{\partial L}{\partial w} \bigg|_{w=w^0, b=b^0}$ $b^1 \leftarrow b^0 - \eta \frac{\partial L}{\partial b} \bigg|_{w=w^0, b=b^0}$
However, linear models are too simple. Linear models have severe limitations called model bias.
All piecewise linear curves:
More pieces require more blue curves.
How to represent a blue curve (Hard Sigmoid function): Sigmoid function
We can change
Different Sigmoid curves -> Combine to approximate different piecewise linear functions -> Approximate different continuous functions
From a model with high bias
Also, if we consider multiple features
Here,
Our loss function is now expressed as
Optimization is still the same.
- (Randomly) pick initial values
$\boldsymbol{\theta}^0$ - calculate the gradient $\boldsymbol{g} = \begin{bmatrix} \frac{\partial{L}} {\partial{\theta_1}}\bigg|{\boldsymbol{\theta}=\boldsymbol{\theta}^0} \ \frac{\partial{L}}{\partial{\theta_2}}\bigg|{\boldsymbol{\theta}=\boldsymbol{\theta}^0} \ \vdots \end{bmatrix} = \nabla L(\boldsymbol{\theta}^0)$ with
$\boldsymbol{\theta}, \boldsymbol{g} \in \mathbb{R}^n$ - perform the update step: $\begin{bmatrix} \theta_1^1 \ \theta_2^1 \ \vdots \end{bmatrix} \leftarrow \begin{bmatrix} \theta_1^0 \ \theta_2^0 \ \vdots \end{bmatrix} - \begin{bmatrix} \eta \frac{\partial{L}}{\partial{\theta_1}}\bigg|{\boldsymbol{\theta}=\boldsymbol{\theta}^0} \ \eta \frac{\partial{L}}{\partial{\theta_2}}\bigg|{\boldsymbol{\theta}=\boldsymbol{\theta}^0} \ \vdots \end{bmatrix}$, namely
$\boldsymbol{\theta}^1 \leftarrow \boldsymbol{\theta}^0 - \eta \boldsymbol{g}$
The terms batch and epoch are different.
Batch size
It looks kind of like the Hard Sigmoid function we saw earlier:
As a result, we can use these two RELU curves to simulate the same result as Sigmoid curve.
But, why we want "Deep" network, not "Fat" network? Answer to that will be revealed in a later lecture.
Training data is
Traing steps:
- write a function with unknown parameters, namely
$y = f_{\boldsymbol{\theta}}(\boldsymbol{x})$ - define loss from training data:
$L(\boldsymbol{\theta})$ - optimization:
$\boldsymbol{\theta}^* = \arg \min_{\boldsymbol{\theta}} L$ - Use
$y = f_{\boldsymbol{\theta}^*}(\boldsymbol{x})$ to label the testing data
Model bias: the potential function set of our model does not even include the desired optimal function/model.
Large loss doesn't always imply issues with model bias. There may be issues with optimization. That is, gradient descent does not always produce global minima. We may stuck at a local minima. In the language of function set, the set theoretically contain optimal function
- gain insights from comparison to identify whether the failure of the model is due to optimization issues, overfitting or model bias
- start from shallower networks (or other models like SVM which is much easier to optimize)
- if deeper networks do not contain smaller loss on training data, then there is optimization issues (as seen from the graph below)
For example, here, the 5-layer should always do better or the same as the 4-layer network. This is clearly due to optimization problems.
Solutions:
- more training data
- data augumentation (in image recognition, it means flipping images or zooming in images)
- use a more constrained model:
- less parameters
- sharing parameters
- less features
- early stopping
- regularization
- dropout
For example, CNN is a more constrained version of the fully-connected vanilla neural network.
Mismatch occurs when the training dataset and the testing dataset comes from different distributions. Mismatch can not be prevented by simply increasing the training dataset like we did to overfitting. More information on mismatch will be provided in Homework 11.
Optimization fails not always because of we stuck at local minima. We may also encounter saddle points, which are not local minima but have a gradient of
All the points that have a gradient of
If we are stuck at a local minima, then there's no way to further decrease the loss (all the points around local minima are higher); if we are stuck at a saddle point, we can escape the saddle point. But, how can we differentiate a saddle point and local minima?
Gradient
Hessian
The green part is the Gradient and the red part is the Hessian.
When we are at the critical point, The approximation is "dominated" by the Hessian term.
Namely, our approximation formula becomes:
Local minima:
- For all
$\boldsymbol{v}$ , if$\boldsymbol{v}^T H \boldsymbol{v} > 0$ ($H$ is positive definite, so all eigenvalues are positive), around$\boldsymbol{\theta}'$ :$L(\boldsymbol{\theta}) > L(\boldsymbol{\theta}')$
Local maxima:
- For all
$\boldsymbol{v}$ , if$\boldsymbol{v}^T H \boldsymbol{v} < 0$ ($H$ is negative definite, so all eigenvalues are negative), around$\boldsymbol{\theta}'$ :$L(\boldsymbol{\theta}) < L(\boldsymbol{\theta}')$
Saddle point:
- Sometimes
$\boldsymbol{v}^T H \boldsymbol{v} < 0$ , sometimes$\boldsymbol{v}^T H \boldsymbol{v} > 0$ . Namely,$H$ is indefinite -- some eigenvalues are positive and some eigenvalues are negative.
Example:
If by analyzing
Suppose
If the eigenvalue
However, this method is seldom used in practice because of the huge computation need to compute the Hessian matrix and the eigenvectors/eigenvalues.
A local minima in lower-dimensional space may be a saddle point in a higher-dimension space. Empirically, when we have lots of parameters, local minima is very rare.
Note that here, "time for cooldown" does not always determine the time it takes to complete an epoch.
Emprically, large batch size
Smaller batch requires longer time for one epoch (longer time for seeing all data once).
However, large batches are not always better than small batches. That is, the noise brought by small batches lead to better performance (optimization).
Small batch is also better on testing data (overfitting).
This may be because that large batch is more likely to lead to us stucking at a sharp minima, which is not good for testing loss. Because of noises, small batch is more likely to help us escape sharp minima. Instead, at convergence, we will more likely end up in a flat minima.
Batch size is another hyperparameter.
Vanilla Gradient Descent:
Gradient Descent with Momentum:
The fact that training process is stuck does not always mean small gradient.
Training can be difficult even without critical points. Gradient descent can fail to send us to the global minima even under the circumstance of a convex error surface. You can't fix this problem by adjusting the learning rate
Learning rate can not be one-size-fits-all. If we are at a place where the gradient is high (steep surface), we expect
Formulation for one parameter:
$$ \boldsymbol{g}_i^t = \frac{\partial L}{\partial \boldsymbol{\theta}i} \bigg |{\boldsymbol{\theta} = \boldsymbol{\theta}^t} $$
The new formulation becomes:
Why this formulation works?
However, this formulation still has some problems. We assumed that the gradient for one parameter will stay relatively the same. However, it's not always the case. For example, there may be places where the gradient becomes large and places where the gradient becomes small (as seen from the graph below). The reaction of this formulation to a new gradient change is very slow.
The new formulation is now:
$$
\sigma_i^t = \sqrt{\alpha(\sigma_i^{t-1})^2 + (1-\alpha)(\boldsymbol{g}_i^t)^2}
$$
The Adam optimizer is basically the combination of RMSProp and Momentum.
This is the optimization process with Adagrad:
To prevent the osciallations at the final stage, we can use two methods:
As the training goes, we are closer to the destination. So, we reduce the learning rate
This improves the previous result:
We first increase
We can't directly apply regression to classification problems because regression tends to penalize the examples that are "too correct."
It's also problematic to directly represent Class 1 as numeric value
One possible model is:
The loss function denotes the number of times
We can represent classes as one-hot vectors. For example, we can represent Class
We know that
Minimizing cross-entropy is equivalent to maximizing likelihood.
Cross-entropy is more frequently used for classification than MSE. At the region with higher loss, the gradient of MSE is close to
We can therefore predict the distribution of
Input: vector
We assume
$$ \mu^, \Sigma^ = \arg \max_{\mu,\Sigma} L(\mu, \Sigma) $$
The solution is as follows:
$$ \Sigma^* = \frac{1}{79} \sum_{n=1}^{79} (x^n - \mu^)(x^n - \mu^)^T $$
But the above generative model fails to give a high-accuracy result. Why? In that formulation, every class has its unique mean vector and covariance matrix. The size of the covariance matrix tends to increase as the feature size of the input increases. This increases the number of trainable parameters, which tends to result in overfitting. Therefore, we can force different distributions to share the same covariance matrix.
Intuitively, the new covariance matrix is the sum of the original covariance matrices weighted by the frequencies of samples in each distribution.
We can always use whatever distribution we like (we use Guassian in the previous example).
If we assume all the dimensions are independent, then you are using Naive Bayes Classifier.
Each
But if the assumption does not hold, the Naive Bayes Classifier may have a very high bias.
Furthermore:
Further simplification goes:
Since we assume the distributions share the covariance matrix, we can further simplify the formula:
This is why the decision boundary is a linear line.
In generative models, we estimate
We want to find
The function set is therefore (including all different
Given the training data
$$ w^,b^ = \arg \max_{w,b} L(w,b) $$
We can write the formulation by introducing
Therefore, minimizing
Here, the larger the difference ($\hat{y}^n - f_{w,b}(x^n)$) is, the larger the update.
Therefore, the update step for logistic regression is:
This looks the same as the update step for linear regression. However, in logistic regression,
Comparision of the two algorithms:
Why using square error instead of cross entropy on logistic regression is a bad idea?
In either case, this algorithm fails to produce effective optimization. A visualization of the loss functions for both cross entropy and square error is illustrated below:
The logistic regression is an example of discriminative model, while the Gaussian posterior probability method is an example of generative model, through their function set is the same.
We will not obtain the same set of
A toy example shows why the generative model tends to perform less well. We assume Naive Bayes here, namely
Softmax will further enhance the maximum
Like the binary classification case earlier, the multiclass classification aims to maximize likelihood, which is the same as minimizing cross entropy.
Solution: feature transformation
However, it is not always easy to find a good transformation. We can cascade logistic regression models.
In a linear model, when the value at each dimension have very distinct values, we may witness an error surface that is steep in one dimension and smooth in the other, as seen from the left graph below. If we can restrict the value at each dimension to be of the same range, we can make the error surface more "trainable," as seen from the right graph below.
Recall standardization:
$$
\boldsymbol{\tilde{x}}^{r}{i} \leftarrow \frac{\boldsymbol{x}^{r}{i} - m_{i}}{\sigma_{i}}
$$
For the Sigmoid activation function, we can apply feature normalization on
When doing feature normalization, we can use element-wise operation on
Notice that now if
We can also make a small improvement:
$$ \boldsymbol{\hat{z}}^{i} = \boldsymbol{\gamma} \odot \boldsymbol{\tilde{z}}^{i} + \boldsymbol{\beta} $$ We set $\boldsymbol{\gamma}$ to $[1,1,...]^T$ and $\boldsymbol{\beta}$ to $[0,0,...]^T$ at the start of the iteration (they are *parameters* of the network). This means that the range will still the the same amongst different dimensions in the beginning. As the training goes, we may want to lift the contraint that each dimension has a mean of 0.We do not always have batch at testing stage.
Computing the moving average of
A neuron does not have to see the whole image. Every receptive field has a set of neurons (e.g. 64 neurons).
The same patterns appear in different regions. Every receptive field has the neurons with the same set of parameters.
The convolutional layer produces a feature map.
The feature map becomes a "new image." Each filter convolves over the input image.
A filter of size 3x3 will not cause the problem of missing "large patterns." This is because that as we go deeper into the network, our filter will read broader information (as seen from the illustration below).
Subsampling the pixels will not change the object. Therefore, we always apply Max Pooling (or other methods) after convolution to reduce the computation cost.
CNN is not invariant to scaling and rotation (we need data augumentation).
We often use word embedding for sentences. Each word is a vector, and therefore the whole sentence is a vector set.
Audio can also be represented as a vector set. We often use a vector to represent a
Graph is also a set of vectors (consider each node as a vector of various feature dimensions).
Each vector has a label (e.g. POS tagging). This is also called sequence labeling.
The whole sequence has a label (e.g. sentiment analysis).
It is also possible that the model has to decide the number of labels itself (e.g. translation). This is also called seq2seq.
The self-attention module will try to consider the whole sequence and find the relevant vectors within the sequence (based on the attention score
There are many ways to calculate
Additive:
Dot-product: (the most popular method)
The attention score will then pass through softmax (not necessary, RELU is also possible).
We will then extract information based on attention scores (after applying softmax).
If
The creation of
We can also view self-attention using matrix algebra.
Since every
Remeber that:
As a result, we can write:
In addition, we can use the same method for calculating attention scores:
Here, since
By applying softmax, we make sure that every column of
We use the same method to write the final output
This is based on matrix-vector rules:
Summary of self-attention: the process from
We may have different metrics of relevance. As a result, we may consider multi-head self-attention, a variant of self-attention.
We can then combine
Self-attention does not care about position information. For example, it does not care whether
If the input sequence is of length
What's its difference with CNN?
- CNN is the self-attention that can only attends in a receptive field. Self-attention is a CNN with learnable receptive field.
- CNN is simplified self-attention. Self-attention is the complex version of CNN.
Self-attention is more flexible and therefore more prune to overfitting if the dataset is not large enough. We can also use conformer, a combination of the two.
Self-attention is a more complex version of RNN. RNN can be bi-directional, so it is possible to consider the whole input sequence like self-attention. However, it struggles at keeping a vector at the start of the sequence in the memory. It's also computationally-expensive because of its sequential (non-parallel) nature.
This is one type of GNN.
How to represent a word?
The vector is lexicon size. Each dimension corresponds to a word in the lexicon. The dimension for the word is 1, and others are 0. For example, lexicon = {apple, bag, cat, dog, elephant}. Then, apple = [1 0 0 0 0].
The memory units must have an initial value. Changing the order of the sequence will change the output.
The same network is used again and again. If the words before a particular word is different, then the values stored in the memory will be different, therefore causing the probability (i.e. output) of the same word different.
We can also make the RNN deeper:
Jordan Network tends to have a better performance because we can know what exactly is in the memory unit (the output itself).
We can also train two networks at once. This way, the network can consider the entire input sequence.
Why LSTM? The RNN will wipe out memory in every new timestamp, therefore having a really short short-term memory. However, the LSTM can hold on to the memory as long as the Forget Gate
This is the structure of a LSTM cell/neuron:
When
When
Same story for
A LSTM neuron has four times more parameters than a vanilla neural network neuron. In vanilla neural network, every neuron is a function mapping a input vector to a output scalar. In LSTM, the neuron maps
Assume the total number of cells is
We will apply
We can use element-wise operations on those vectors to conduct these operations for all
RNN training relies on Backpropagation Through Time (BPTT), a variant of backpropagation. RNN-based network is not always easy to learn.
This is why it is difficult to train RNN. When we are at the flat plane, we often have a large learning rate. If we happen to step on the edge of the steep cliff, we may jump really far (because of the high learning rate and high gradient). This may cause optimization failure and segmentation fault.
This can be solved using Clipping. We can set a maximum threshold for the gradient, so that it does not become really high.
Why do we observe this kind of behavior for RNN's error surface?
We can notice that the same weight
One solution is LSTM. It can deal with gradient vanishing, but not gradient explode. As a result, most points on LSTM error surface with have a high gradient. When training LSTM, we can thus set the learning rate a relatively small value.
Why LSTM solves gradient vanishing? Memory and input are added. The influence never disappears unless Forget Gate is closed.
An alternative option is Gated Recurrent Unit (GRU). It is simpler than LSTM. It only has 2 gates: combining the Forget Gate and the Input Gate.
RNN can also be applied on many-to-one tasks (input is a vector sequence, but output is only one output), such as sentiment analysis. We can set the output of RNN at the last timestamp as our final output.
RNN can be used on many-to-many tasks (both input and output are vector sequences, but the output is shorter). For example, when doing speech recognition task, we can use Connectionist Temporal Classification (CTC).
RNN can also be applied on sequence-to-sequence learning (both input and output are both sequences with different lengths), such as machine translation.
How do we utilize the structures and relationship to help train our model?
- Aggregation: use neighbor features to update hidden states in the next layer
-
Readout: use features of all the nodes to represent the whole graph
$h_G$
Node features:
We input a sequence and the model output a sequence. The output length is determined by the model. We use it for speech recognition and translation.
Seq2seq is widely used for QA tasks. Most NLP applications can be viewed as QA tasks. However, we oftentimes use specialized models for different NLP applications.
Seq2seq can also be used on multi-label classification (an object can belong to multiple classes). This is different from multi-class classification, in which we need to classify an object into one class out of many classes.
The basic components of seq2seq is:
We need to output a sequence that has the same length as the input sequence. We can technically use CNN or RNN to accomplish this. In Transformer, they use self-attention.
The state-of-the-art encoder architecture looks like this:
The Transformer architecture looks like this:
Residual connection is a very popular technique in deep learning:
Decoder will receive input that is the its own output in the last timestamp. If the decoder made a mistake in the last timestamp, it will continue that mistake. This may cause error propagation.
The encoder and the decoder of the Transformer is actually quite similar if we hide one part of the decoder.
Masked self-attention: When considering
We also want to add a stop token (along with the vocabulary and the start token) to give the decoder a mechanism that it can control the length of the output sequence.
How to decide the output length for NAT decoder?
- Another predictor for output length.
- Determine a maximum possible length of sequence,
$n$ . Feed the decoder with$n$ START tokens. Output a very long sequence, ignore tokens after END.
Advantage: parallel (relying on self-attention), more stable generation (e.g., TTS) -- we can control the output-length classifier to manage the length of output sequence
NAT is usually worse than AT because of multi-modality.
Cross Attention:
This is very similar to how to we train a classification model. Every time the model creates an output, the model makes a classification.
Our goal is to minimize the sum of cross entropy of all the outputs.
Teacher forcing: using the ground truth as input.
Sometimes we may want the model to just copy the input token. For example, consider a chatbot, when the user inputs "Hi, I'm ___," we don't expect the model to generate an output of the user's name because it's not likely in our training set. Instead, we want the model to learn the pattern: when it sees input "I'm [some name]," it can output "Hi [some name]. Nice to meet you!"
In some tasks, input and output are monotonically aligned. For example, speech recognition, TTS (text-to-speech), etc. We don't want the model to miass some important portions of the input sequence in those tasks.
We want to force the model to learn a particular order of attention.
- monotonic attention
- location-aware attention
The red path is greedy decoding. However, if we give up a little bit at the start, we may get a better global optimal path. In this case, the green path is the best one.
We can use beam search to find a heuristic.
However, sometimes randomness is needed for decoder when generating sequence in some tasks (e.g. sentence completion, TTS). In those tasks, finding a "good path" may not be the best thing because there's no correct answer and we want the model to be "creative." In contrast, beam search may be more beneficial to tasks like speech recognition.
At training, the model always sees the "ground truth" -- correct input sequence. At testing, the model is fed with its own output in the previous round. This may cause the model to underperform because it may never see a "wrong input sequence" before (exposure bias). Therefore, we may want to train the model with some wrong input sequences. This is called Scheduled Sampling. But this may hurt the parallel capability of the Transformer.
Generator is a network that can output a distribution.
Why we bother add a distribution into our network?
In this video game frame prediction example, the model is trained on a dataset of two coexisting possibilities -- the role turning left and right. As a result, the vanilla network will seek to balance the two. Therefore, it could create a frame that a role splits into two: one turning left and one turning right.
This causes problems. Therefore, we want to add a distribution into the network. By doing so, the output of the network will also become a distribution itself. We especially prefer this type of network when our tasks need "creativity." The same input has different correct outputs.
For unconditional generation, we don't need the input
GAN architecture also has a discriminator. This is just a vanilla neural network (CNN, transformer ...) we've seen before.
The adversarial process of GAN looks like this:
The algorithm is:
- (Randomly) initialize generator and discriminator's parameters
- In each training iteration:
-
Fix generator
$G$ and update discriminator$D$ . This task can be seen as either a classification (labeling true images as$1$ and generator-generated images$0$ ) or regression problem. We want discriminator to learn to assign high scores to real objects and local scores to generated objects. -
Fix discriminator
$D$ and update generator$G$ . Generator learns to "fool" the discriminator. We can use gradient ascent to train the generator while freezing the paramters of the discriminator.
-
Fix generator
The GNN can also learn different angles of face. For example, when we apply interpolation on one vector that represents a face facing left and the other vector that represents a face to the right. If we feed the resulting vector into the model, the model is able to generate a face to the middle.
where
The hardest part of GNN training is how to formulate the divergence. But, sampling is good enough. Although we do not know the distributions of
For discriminator,
$$ V(G, D) = \mathbb{E}{y \sim P{\text{data}}} [\log D(y)] + \mathbb{E}_{y \sim P_G} [\log (1 - D(y))] $$
Since we want to maximize
Recall that cross-entropy
Since we often minimize cross-entropy, we can find similarities here as well:
In additon,
Therefore,
This is how the GAN algorithm was designed (to solve the optimization problem above).
GAN is known for its difficulty to be trained.
In most cases,
-
The nature of the data is that both
$P_G$ and$P_{\text{data}}$ are low-dimensional manifold in a high-dimensional space. That is, most pictures in the high-dimensional space are not pictures, let alone human faces. So, any overlap can be ignored. -
Even when
$P_G$ and$P_{\text{data}}$ have overlap, the discriminator could still divide them if we don't have enough sampling.
The problem with JS divergence is that JS divergence always outputs
In addition, when two classifiers don't overlap, binary classifiers can always achieve
The accuracy (or loss) means nothing during GAN training.
Considering one distribution P as a pile of earth, and another distribution Q as the target, the Wasserstein Distance is the average distance the earth mover has to move the earth. In the case above, distribution
However, when we consider two distributions, the distance can be difficult to calculate.
Since there are many possible "moving plans," we use the “moving plan” with the smallest average distance to define the Wasserstein distance.
$$ W(P_{\text{data}}, P_G) = \max_{D \in \text{1-Lipschitz}} \left{ \mathbb{E}{y \sim P{\text{data}}} [D(y)] - \mathbb{E}{y \sim P{G}} [D(y)] \right} $$
When the two distributions are very close, the two extremes can't be too far apart. This causes
GAN is still challenging because if either generator or decoder fails to imrpove, the other will fail.
More tips on how to train GAN:
GAN can also be applied on sequence generation:
In this case, the seq2seq model becomes our generator. However, this can be very hard to train. Since a tiny change in the parameter of generator will likely not affect the output of the generator (since the output is the most likely one), the score stays unchanged.
As a result, you can not use gradient descent on this. Therefore, we usually apply RL on Sequence Generation GAN.
Usually, the generator are fine-tuned from a model learned by other approaches. However, with enough hyperparameter-tuning and tips, ScarchGAN can train from scratch.
There are also other types of generative models: VAE and Flow-Based Model.
There are some other possible solutions. We can assign a vector to every image in the training set. We can then train the model using the vector and image pair.
Early in the start of GAN, we rely on human evaluation to judge the performance of generation. However, human evaluation is expensive (and sometimes unfair/unstable). How to evaluate the quality of the generated images automatically?
We can use an image classifer to solve this problem:
If the generated image is not like a real image, the classifer will have a more evenly spread distribution.
However, the generator could still be subject to a problem called Mode Collapse (a diversity problem). After generating more and more images, you could observe that the generated images look mostly the same.
This could be because the generator can learn from the discriminator's weakness and focus on that particular weakness.
Mode Dropping is more difficult to be detected. The distribution of generated data is diverse but it's still a portion of the real distribution.
Good quality and large diversity will produce a larger IS.
All the points are the direct results (before being passed into softmax) produced by CNN. We also feed real images into the CNN to produce the red points. We assume both blue and red points come from Gaussian distributions. This may sometimes be problematic. In addition, to accurately obtain the distributions, we may need a lot of samples, which may lead to a huge computation cost.
However, we don't want a "memory GAN." If the GAN just outputs the samples (or alter them a tiny amount, like flipping), it may actually obtain a very low FID score. However, this is not what we want.
Condition generation is useful for text-to-image tasks.
The discriminator will look at two things:
- is output image
$y$ realistic or not - are text input
$x$ and$y$ matched or not
It's important that we add a good image which does not match the text input as our training data.
It's also common to apply Conditional GAN on image translation, i.e. pix2pix.
We can technically use either supervised learning or GAN to design the model. However, in terms of supervised learning, since there're many correct outputs for a given input, the model may try to fit to all of them. The best output is when we use both GAN and supervised learning.
Conditional GAN can also be applied on sound-to-image generation.
GAN can also be applied on unsupervised learning. In some cases like Image Style Transfer, we may not be able to obtain any paired data. We could still learn mapping from unpaired data using Unsupervised Conditional Generation.
Instead of sampling from a Gaussian distribution like vanilla GAN, we sample from a particular domain
However, the model may try to ignore the input because as long as it generates something from domain
In theory, the two generators may learn some strange mappings that prevent the model from actually output a related image. However, in practice, even with vanilla GAN, image style transfer can be learned by the model (the model prefers simple mapping, i.e. output something that looks like the input image).
Cycle GAN can also be in both ways:
It can also be applied on text-style transfer. This idea is also applied on Unsupervised Abstractive Summarization, Unsupervised Translation, Unsupervised ASR.
What does the denoise module look like specifically? It's easier to train a noise predicter than training a model that can "draw the cat" from scratch.
How can we obtain the training dataset for training a noise predictor? This involves the Forward Process, a.k.a. Diffusion Process.
We can then use our dataset to train the model.
How do we use it for text-to-image tasks?
Self-supervised learning is a form of unsupervised learning.
BERT is a transformer encoder, outputing a sequence of the same length as the input sequence.
We randomly mask some tokens with either a special token or some random tokens. We then aim to train the model to minimize the cross entropy between the output and the ground truth. During training, we aim to update parameters of both the BERT and the linear model.
This task tries to train the model to output a "yes" for thinking that sentence 2 does follow sentence 1 and a "no" for thinking that sentence 2 does not follow sentence 1. However, this approach is not very useful probably because this task is relatively easy and the model does not learn very much. An alternative way is to use SOP, which trains the model to learn whether sentence 1 is before sentence 2 or after sentence 2.
With self-supervised learning, we pre-train the model, which prepares the model for downstream tasks (tasks we care and tasks that we've prepared a little bit labeled data). We can then fine-tune the model for different tasks.
With a combination of unlabelled and labelled dataset, this training is called semi-supervised learning.
GLUE has
Example: sentiment analysis
Note that here, we only randomly initialize the parameters in the linear model, instead of BERT. Also, it's important that we train the model with labelled dataset.
Fining tuning the model can accelerate the optimization process and improve the performance (lower the loss), as shown in the graph below.
Example: POS tagging
Example: Natural Language Inference (NLI)
In this case, as usual, both
The orange and blue vectors are the only two set of parameters we need to train.
It's also possible to pre-train a seq2seq model. We can corrupt the input to the Encoder and let the Decoder reconstruct the original input.
There're many ways you can use to corrupt the input.
The tokens with similar meaning have similar embedding. BERT can output token vectors that represent similar words with high cosine similarity.
There was a similar work conducted called CBOW word embedding. Using deep learning, BERT appeas to perform better. It can learn the contextualized word embedding.
However, learning context of a word may not be the whole story. You can even use BERT to classify meaningless sentences. For example, you can replace parts of a DNA sequence with a random word so that a whole DNA sequence becomes a sentence. You can even use BERT to classify the DNA sequences represented with sentences. Similar works have been done on protein, DNA and music classification.
BERT can also be trained on many languages.
Cross-Lingual Alignment: assign similar embeddings to the same words from different languages.
Mean Reciprocal Rank (MRR): Higher MRR, better alignment
GPT uses a different pre-training method -- next token prediction.
Auto-encoder is often used for dimension reduction (the output vector of the NN encoder). This idea is very similar to Cycle GAN.
The variation of pictures is very limited. There are many possible
BERT can actually be seen as a de-nosing auto-encoder. Note that the decoder does not have to be a linear model.
The embedding includes information of different aspects.
However, we don't need know exactly what dimensions of an embedding contains a particular aspect of information.
Related papers:
Application: Voice Conversion
In the last example, be constraining the embedding to be one-hot, the encoder is able to learn a classification problem (unsupervised learning).
The most famous example is: Vector Quantized Variational Auto-encoder (VQVAE).
The parameters we learn in VQVAE are: Encoder, Decoder, the
This process is very similar to attention. We can think of the output of the Encoder as the Query, the Codebook as Values and Keys.
This method constrains the input of the Decoder to be one of the inputs in the Codebook.
Using this method on speech, the model can learn phonetic information which is represented by the Codebook.
We can achieve unsupervised summarization using the architecture above (actually it's just a CycleGAN model). We only need crawled documents to train the model.
Both the Encoder and the Decoder should use seq2seq models because they accept sequence as input and output. The Discriminator makes sure the summary is readable. This is called a seq2seq2seq auto-encoder.
We can also gain a generator by using the trained Decoder in auto-encoder.
With some modification, we have variational auto-encoder (VAE).
The image reconstruction is not
Anomaly Detection: Given a set of training data
Use cases:
Anomaly Detection is better than training a binary classifier because we often don't have the dataset of anomaly (oftentimes we humans can't even detect anomaly ourselves). Our dataset is primarily composed of "normal" data. Anomaly Detection is often called one-class classification.
This is how we implement anomaly detecter:
There are many ways we can implement anomaly detection, besides anomaly detecter.
Some models are intrinsically interpretable.
- For example, linear model (from weights, you know the importance of features)
- But not very powerful
- Deep network is difficult to be interpretable. Deep networks are black boxes, but powerful than a linear model.
For a cat classifier:
Local Explanation: why do you think this image is a cat?
Global Explanation: what does a cat look like? (not referring to a specific image)
For an object
We asks the question: "which component is critical for making a decision?" We can therefore remove or modify the components. The important components are those that result in large decision change.
Sometimes, our Saliency Map may appear noisy (noisy gradient). We can use SmoothGrad to solve that -- Randomly add noises to the input image, get saliency maps of the noisy images, and average them.
However, there may be problems with gradient saturation. Gradients can not always reflect importance. In this case, a longer trunk size does not make the object more elephant. We can use Integrated Gradient to solve this problem.
We can learn that the network knows that the same sentence spoken by different speaker is similar (erase the identity of the speaker).
Using this method, we can tell whether the embeddings have information of POS properties by evaluating the accuracy of the POS (Part of Speech) or NER (Named-Entity Recognition) classifier. If the accuracy is high, then the embeddings do contain those information.
If the classifier itself has very high bias, then we can not really tell whether the embeddings have those information based on accuracy.
We can also train a TTS model that takes the output of a particular hidden layer as input. If the TTS model outputs an audio sequence that has no speaker information. Then, we can conclude that the model has learned to earse speaker information when doing audio-to-text tasks.
We can know based on feature maps (the output of a filter) what patterns a particular filter is responsible for detecting. As a result, for an unknown image
$X^$ is the image that contains the patterns filter 1 can detect. We can find $X^$ using gradient ascent. One example of the resulting
However, if we try to find the best image (that maximizes the classification probability),
we fail to see any meaningful pattern:
The images should look like a digit. As a result, we need to add constraints
where:
To make the images clearer, we may need several regularization terms, and hyperparameter tuning...
For example, we can append an image generator
We can then get the image we want by calculating:
We can use an interpretable model (linear model) to mimic the behavior of an uninterpretable model, e.g. NN (not the complete behavior, only "local range").
This idea is called Local Interpretable Model-Agnostic Explanations (LIME)
Assume we know the parameters
Our goal is to find a new picture
Since we also want the noise to be as little as possible, we add an additional constraint:
We also define $\boldsymbol{\Delta x}$ :
$$
\boldsymbol{\Delta x} = \boldsymbol{x} - \boldsymbol{x^0}
$$
This can be seen as:
$$
\begin{bmatrix}
\Delta x_1 \
\Delta x_2 \
\Delta x_3 \
\vdots
\end{bmatrix}
\begin{bmatrix}
x_1^0 \
x_2^0 \
x_3^0 \
\vdots
\end{bmatrix}
$$
There are many ways to calculate distance
L2-norm: $$ d(\boldsymbol{x^0}, \boldsymbol{x}) = |\boldsymbol{\Delta x}|2 = \sqrt{(\Delta x_1)^2 + (\Delta x_2)^2 + \dots} $$ L-infinity: $$ d(\boldsymbol{x^0}, \boldsymbol{x}) = |\boldsymbol{\Delta x}|{\infty} = \max{|\Delta x_1|,|\Delta x_2|, \dots} $$ L-infinity is a better metric for images because it fits human perception of image differences (as seen from the example below). We may need to use other metrics depending on our domain knowledge.
The Loss function can be defined depending on the specific attack:
Non-Targeted Attack:
The Loss function
Targeted Attack:
Since we want to minimize the Loss function, we add the cross entropy between the output and the target label (since we want the two to be similar, therefore a smaller
How can we do optimization?
If we assume we don't have the constraint on
- Start from the original image
$\boldsymbol{x^0}$ - From
$t=1$ to$T$ :$\boldsymbol{x^t} \leftarrow \boldsymbol{x^{t-1}} - \eta \boldsymbol{g}$
If we consider the constraint, the process is very similar:
- Start from the original image
$\boldsymbol{x^0}$ - From
$t=1$ to$T$ :$\boldsymbol{x^t} \leftarrow \boldsymbol{x^{t-1}} - \eta \boldsymbol{g}$ - If
$d(\boldsymbol{x^0}, \boldsymbol{x}) > \epsilon$ , then$\boldsymbol{x^t} \leftarrow fix(\boldsymbol{x^t})$
If we are using L-infinity, we can fix
We can use Fast Gradient Sign Method (FGSM, https://arxiv.org/abs/1412.6572):
Redefine
- Start from the original image
$\boldsymbol{x^0}$ - Do only one shot:
$\boldsymbol{x^t} \leftarrow \boldsymbol{x^{t-1}} - \epsilon \boldsymbol{g}$
In the case of L-infinity,
There's also an iterative version of FGSM, which is basically doing Step 2
What if we don't have parameters of a private network?
If you have the training data of the target network:
- Train a proxy network yourself
- Using the proxy network to generate attacked objects
If we don't have training data, we can just the targeted network to generate a lot of data points.
It's also possible to do One Pixel Attack (only add noise to one pixel of an image) and Universal Adversarial Attack (use one noise signals to attack all the image inputs for a model).
- An attacker would need to find perturbations that generalize beyond a single image.
- Extreme differences between adjacent pixels in the perturbation are unlikely to be accurately captured by cameras.
- It is desirable to craft perturbations that are comprised mostly of colors reproducible by the printer
- Attack happens in the training phase
- We need to be careful of open public dataset
Passive Defense means that we do not change the parameter of the model.
The overall objective is to apply a filter to make the attack signal less powerful.
One way to implement this is that we can apply Smoothing on the input image. This will alter the attack signal, making it much less harmful. However, we need to pay attention to its side effect -- it may make the classification confidence lower, though it rarely affects the accuracy.
We can also use image compression and generator to implement this.
However, these methods have some drawbacks. For example, if the attackers know these methods are used as defenses, they can obtain a signal that has taken into account these filters (just treat them as the starting layers of the network). Therefore, we can use randomization.
We can use Adversarial Training (training a model that is robust to adversarial attack). This method is an example of Data Augumentation.
Given training set
Using
-
For
$n = 1$ to$N$ : Find adversarial input$\boldsymbol{\tilde{x}^n}$ given$\boldsymbol{x^n}$ by an attack algorithm -
We then have new training data
$\mathcal{X}' = {(\boldsymbol{\tilde{x}^1}, \hat{y}^1), ..., (\boldsymbol{\tilde{x}^N}, \hat{y}^N)}$ -
Using both
$\mathcal{X}$ and$\mathcal{X}'$ to update your model -
Repeat Steps 1-3
However, if we did not consider an attack algorithm in Adversarial Training, this method will not prevent the attack from that algorithm.
However, Adversarial Training is very computationally expensive.
Domain shift: Training and testing data have different distributions
There are also many different aspects of doman shift:
If we have some labeled data points from the target domain, we can train a model by source data, then fine-tune the model by target data. However, since there's only limited target data, we need to be careful about overfitting.
However, the most common scenairo is that we often have large amount of unlabeled target data.
We can train a Feature Extractor (a NN) to output the features of the input.
Domain Classifier is a binary classifier. The goal of the Label Predictor is to be as accurate as possible on Source Domain images. The goal of the Domain Classifier is to be as accurate as possible on any image.
If we do not know anything about the target domain, then the problem we are dealing with is Domain Generalization.
Just like ML, RL is also looking for a function, the Actor.
The Actor architecture can be anything: FC NN, CNN, transformer...
Input of NN: the observation of machine represented as a vector or a matrix
Output of NN: each action corresponds to a neuron in output layer (they represent the probability of the action the Actor will take -- we don't always take the action with maximum probability)
An episode is the whole process from the first observation
The total reward, a.k.a. return, is what we want to maximize:
A trajectory
Reward is a function of
This idea is similar to GAN. The Actor is like a Generator; the Reward and the Environment together are like the Discriminator. However, in RL, the Reward and the Environment are not NN. This makes the problem not able to be solved by gradient descent.
Make an Actor take or don't take a specific action
By defining the loss and the labels, we can achieve the following goal:
This is almost like supervised learning. We can train the Actor by getting a dataset:
We can also redefine the loss by introducing weights for each {observation, action} pair. For example, we are more desired to see
The difficulty is what determines
We can start with a randomly-initialized Actor and then run it with many episodes. This will help collect us with some data. If we see a observation-action pair that produces positive reward, we will assign a positive
However, generally speaking, this is not a very good strategy since actions are not independent.
- An action affects the subsequent observations and thus subsequent rewards.
- Reward delay: Actor has to sacrifice immediate reward to gain more long-term reward.
- In space invader, only “fire” yields positive reward, so vision 0 will learn an actor that always “fire”.
We need to take into account all the rewards that we gain after performing one action at timestamp
However, if the sequence of length
Discounted Cumulated Reward: $$ A_t = G_t' = \sum_{n=t}^N \gamma^{n-t} r_n $$
But good or bad rewards are "relative." If all the
- Initialize Actor NN parameters
$\theta^0$ - For training iteration
$i=1$ to$T$ :- Using Actor
$\theta^{i-1}$ to interact with the environment. - We then obtain data/training set
${s_1, a_1}, {s_2, a_2}, ..., {s_N, a_N}$ . Note that here the data collection is in thefor
loop of training iterations. - Compute
$A_1, A_2, ..., A_N$ - Compute Loss
$L=\sum_n A_n e_n$ - Update Actor parameters using gradient descent:
$\theta^i \leftarrow \theta^{i-1} - \eta \nabla L$
- Using Actor
This process (from 2.1 to 2.3) is very expensive. Each time you update the model parameters, you need to collect the whole training set again. However, this process is necessary because it is based on the experience of Actor
On-Policy Learning: the actor to train and the actor for interacting are the same.
Off-Policy Learning: the actor to train and the actor for interacting are different. In this way, we don't have to collect data after each update
- One example is Proximal Policy Optimization (PPO). The actor to train has to know its difference with the actor to interact.
Exploration: The actor needs to have some randomness during data collection (remember that we sample our action). If the initial setting of the actor is always performing one action, then we will never know whether other actions are good or bad. If we don't have randomness, we can't really train the actor. With exploration, we can collect more diverse data. We sometimes want to even deliberately add randomness. We can enlarge output entropy or add noises onto parameters of the actor.
Critic: Given actor
- An example is value function
$V^{\theta}(s)$ : When using actor$\theta$ , the discounted cumulated reward (see Version 2) expected to be obtained after seeing$s$ . Note that since the function depends on$\theta$ , the same observation$s$ will have a different associated value if the actors are different. The output values of a critic depend on the actor evaluated.
How to train a critic?
One way is to use Monte-Carlo (MC) based approach: The critic watches actor
Another way is Temporal-difference (TD) approach. We don't need to wait for the entire episode.
MC and TD may produce different results for the same
That is because MC and TD have different assumptions. For TD, we assume that
How can we use critic on training actor?
Recall that in Version 3, we introduce a baseline
Why this method works? Remember that
However, Version 3 is still problematic because
Therefore, for a pair
The parameters of actor and critic can be shared. In practice, the first several layers are shared in common. If the observation is an image, then the convolutional layers will be shared.
In tasks like robot arm bolting on the screws, the reward
For example, we can incentive the actor to do different things by defining extra rewards based on our domain knowledge.
We can also add curiosity as an extra reward: obtaining extra reward when the agent sees something new (but meaningful).
Defining a reward can be challenging in some tasks. Hand-crafted rewards can lead to uncontrolled behavior.
We can use Imitation Learning. We assume that actor can interact with the environment, but reward function is not available.
We can record demonstration of an expert (for example recording of human drivers in self-driving cars training). We define
- The experts only sample limited observation. The actor will not know what to do in other edge cases.
- The agent will copy every behavior, even irrelevant actions.
This is vanilla RL:
Inverse RL has the exact opposite direction of process. It first learns the reward function and then uses that reward function to find the optimal actor.
The principle is that the teacher is always the best. As a result, the basic algorithm is:
-
Initialize an actor.
-
In each iteration:
- The actor interacts with the environments to obtain some trajectories.
- Define a reward function, which makes the trajectories of the teacher better than the actor.
- The actor learns to maximize the reward based on the new reward function.
- Repeat.
-
Output the reward function and the actor learned from the reward function.
IRL is very similar to GAN:
We may need to deploy ML models in resource-constrained environments because of lower-latency and privacy concerns.
NN are typically over-parameterized -- there's significant redundant weights or neurons.
How to evaluate the importance of a weight? We can either look at its absolute value or its
How to evaluate the importance of a neuron? We can record the number of times it didn't output a
After pruning, the accuracy will drop (hopefully not too much). We can then fine-tune on training data for recover. Remember not to prune too much at once, or the NN won’t recover. We can do the whole process iteratively.
Weight pruning is not always a very effective method. The network architecture becomes irregular (not fully-connected anymore). It also becomes harder to implement and harder to speed up (if you set the pruned weights to
Neuron pruning is a better method as it helps the NN architecture stay regular. It's also easy to implement and easy to speedup.
How about simply train a smaller network?
It is widely known that smaller network is more difficult to learn successfully. It is easier to first train a large NN and then gradually reduce its size.
The Lottery Ticket Hypothesis states that a large network is like a bunch of sub-networks so that we have a higher chance of train the network successfully.
There are two NNs: one large Teacher Net, one small Student Net.
The Student Net can hopefully learn the information that "1" is similar to "7" from the output of the Teacher Net. This is better than simply learning from the dataset labels. In practice, we may also use an ensemble NN to represent the Teacher Net.
We may add Temperature for softmax
The temperature can help make sharp distribution a little bit smoother. This is critical because vanilla softmax can produce results very similar to the one-hot label in the dataset.
We can use less bits to represent a value (lowering the numerical precision).
We can represent frequent clusters by less bits, represent rare clusters by more bits.
It's even possible to make your weights always
Depthwise convolution considers the interaction within a specific channel.
The filter number must be equal to the input channel number (each filter only considers one channel). The number of input channels must be equal to the number of output channels. The filters are
However, one problem is that there is no interaction between channels.
Pointwise convolution considers the interaction amongst channels.
The filter number does not have to be equal to the input channel number. However, we force the filter size to be
Since
Therefore, we can approximate the matrix
The Depthwise Seperable Convolution is very similar to this idea of using two layers to replace one layer in order to reduce the parameter size.
Life-Long Learning (LLL) is also known as Continuous Learning, Never Ending Learning or Incremental Learning.
This problem is also evident in real-world applications:
One example of the problem LLL tries to tackle is:
The network has enough capacity to learn both tasks. However, it still failed.
It's not because machine are not able to do it, but it just didn't do it.
Even though multi-task training can solve this problem, this may lead to issues concerning computation and storage. For example, if we want to train the model on the dataset of the 1000-th task, we have to store the previous 999 tasks. However, multi-task training can be considered as the upper bound of LLL.
Training a model for each task is also a possible solution. However, we cannot store all the models obviously for storage reasons. More importantly, this way, knowledge cannot transfer across different tasks.
LLL is different from transfer learning. In transfer learning, we don't care about whether the model can still do the old task.
Here,
- If
$i>j$ , after training on task$i$ , does the model forgot task$j$ ? - If
$i<j$ , after training on task$i$ , can we transfer the skill of task$i$ to task$j$ ?
We can define the accuracy to be:
Alternatively, we can define the accuracy using a metric called Backward Transfer:
Since
We are also interested in the knowledge transfer. Therefore, there's another score metric called Forward Transfer:
This is a regularization-based approach.
Here is a clear visualization explaining why Catastrophic Forgetting would occur.
The basic idea of Selective Synaptic Plasticity is that some parameters in the model are important to the previous tasks. We shall only change the unimportant parameters.
We can revise the loss function that we want to minimize.
Each parameter learned from the previous tasks
- If
$b_i=0$ , there's no constraint on$\theta_i$ . This will lead to Catastrophic Forgetting. - If
$b_i = \infty$ ,$\theta_i$ would always be equal to$\theta_i^b$ . This will prevent the model from learning anything from training, leading to Intransigence.
We humans set the values of
There's also an older method called Gradient Episodic Memory (GEM):
The learning order also plays a role in whether Catastrophic Forgetting occurs. Thus we have Curriculum Learning:
We can define
We also need to define loss function
How do we evaluate the loss? We can evaluate the classifier on the testing set. Note that training examples (often called support set) and testing examples (often called query set) here are both in the training task.
The loss
Here, the loss of meta learning is different from ML:
Using the optimization approach, if you know how to compute
In typical ML, you compute the loss based on training examples. In meta learning, you compute the loss based on testing examples.
The general framework of meta learning looks like this:
Note that in the testing time, meta learning is different from vanilla ML as it uses Across-Task Testing. The process of one Within-Task Training and one Within-Task Testing is called one episode.
Across-Task Training is often called the outer loop and Within-Task Training is often called the inner loop.
Meta learning is often used to achieve one-shot learning -- we only need a few examples to train a network if the learning algorithm is good enough.
Machine learning is about finding a function
What you know about ML can usually apply to meta learning:
- Overfitting on training tasks
- Get more training tasks to improve performance
- Task augmentation
- There are also hyperparameters when learning a learning algorithm (But hopefully, we can obtain an optimal learning algorithm that can be applied on future tasks)
- Development tasks, alongside training tasks and testing tasks (just like development set used in ML)
Recalled the gradient descent algorithm:
We can learn the initialized parameter