-
Notifications
You must be signed in to change notification settings - Fork 0
/
Neural_Network_Notes.txt
204 lines (159 loc) · 9.86 KB
/
Neural_Network_Notes.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
Basic definitions:
***********************************************************************************************************
Neurons: each neuron receives input from other neurons
-Effect of input line on neuron is controlled by synaptic weight (+ or -)
-Weights adapt so network can learn computations (recognizing objects, language, controlling body)
(Goal in training neural network is to find weights and biases which minimize cost function)
Sigmoid Neurons: Similar to perceptrons, but small changes in weight and biases result in small changes
in output (more fine-tuned)
-Inputs can take any values between 0 and 1, as opposed to 0 and 1 for perceptron
-Stop learning when they become saturated (close to 0 or 1)
Feed-forward neural networks: most common type of neural network
-First layer is input; last layer is output
-If there is more than one hidden layer, called "deep" neural networks
-Compute series of transformations that change similarities between classes
-Activities in each layer are non-linear function of activities in layer below
Statistical pattern recognition:
1. Convert raw input vector into vector of feature activations
-Hand written programs based on common sense used to define features
2. Learn how to weight each of the feature activations to get single scalar quantity
3. If quantity is above some threshold, decide that input vector is positive example of target
Linear neurons:
-Neurons have real-valued output which is weighted sum of its inputs
-Aim of learning is to minimiz error summed over all training cases
-Error is squared difference between desired output and actual output
Logistic neurons:
-Give real-values output that is smooth and bounded function of total input
Learning rate: how easily weights change among the neurons
-If learning is going efficiently, higher learning rate is better
-If learning is going slowly, lower learning rate is better
Mini-batch training: use small samples of training data to slowly reduce error
Sampling error: incorrect patterns may be identified if the training set is too small
Replicated feature detectors: result in equivariant neural activities, invariance in weights
Weight initialization: limit possible starting weights for hidden units, helping to speed up learning process
Softmax: extra output layer that outputs probablility distribution among input units
Validation data: type of training data that helps us learn good hyper-parameters
Weights: coefficient that represents strength between neurons; ultimately what the number stands for is not completely interpretable
-Also represent importance of certain outputs to certain inputs
Bias: How easy it is to get unit to output 1 (threshold)
-Very positive, easy to get 1; very negative, hard to get 1
Gradient of cost function represents vector of derivatives of (cost) / (bias & weight values)
Gradient descent: technique used to minimize quadratic cost function
-Repeatedly compute gradient of cost function, then move in opposite direction, down into slope of valley towards
global minimum in cost (error)
Stochastic gradient descent: estimates gradient of descent for small sample of randomly chosen training examples (works quite well)
Activation: output of specified neuron
-Each neuron has multiple input activations, but only one output activation
-Computed by multiplying input value and weight, adding the bias (this results in the weighted sum of inputs), then
the weighted sum is put into function of unit
One of best ways to reduce overfitting is to increase size of training data
-Tanh neurons:
-Similar to sigmoid neurons, but output ranges from -1 to 1 instead of 0 to 1
-Possibly better than sigmoid??
Activation functions: the specific functions within the neurons that produce output of the neurons
Reinforcement learning: type of learning meant to maximize a specific reward, based on outside interactions
***********************************************************************************************************
Backpropagation algorithm:
***************************
-Fast algorithm for computing gradient of cost function
-OTHER DEF: a clever way of keeping track of small perturbations to the weights (and biases) as they
propogate through the network, reach the output, and then affect the cost (error)
-Starts at output layer, works backwords through hidden layers
-Calculates the error derivative by comparing error derivative of neuron in next layer with and without
output of current neuron
-After calculating error for each desired output on each unit, process is repeated until network is
considered trained
-Hadamard product: multiplicatiton between same positioned elements in two vectors of same dimension
-Calculates error separately from output activation (helps algebraically)
-Weight in the final layer will learn slowly if the output neuron is close to 0 or 1
***************************
Dropout:
********
1. Randomly deletes half of the hidden neurons
2. Forward/back-propogate x through modified network, update weights and biases
3. Repeat process, first restoring dropout neurons, then choosing new random subset of neurons to drop
-To compensate for less neurons, halve weights outgoing from hidden neurons
-Simulates averaging effects of large number of different networks
-Good to use in training deep networks where overfitting in common
********
Regularization:
***************
-Techniques that can reduce overfitting with a fixed network and fixed training data
-Method of compromising between finding small weights and minimizing original cost function
-L2 regularization: add extra term (regularization term) to the cost function
-Keeps weights low so that noise won't affect weights as much; in order for weights to change
significantly, the pattern must be relevant across large amounts of data
-Can change value of extra term using lmda in example code
-No need to regularize bias, as large biases are sometimes desirable
***************
Cross-entropy:
**************
-Measure of cost that tends toward zero as neuron gets better at computing desired output
-Unlike quadratic cost, avoids problem of learning slowing down
-The larger the cross-entropy error, the faster the neuron will learn
-Better than quadratic cost in most cases, as initial weights and bias are usually random
**************
Hyper-Parameter optimization:
*****************************
-Strip down number of expected outputs (greatly reduces amount of training data, increases speed for testing params)
-Decrease number of layers (increases speed)
-As you figure out better params, slowly re-introduce all elements of network
-Learning rate:
-Estimate threshold for LR at which cost on training data immediately begins decreasing (start with .01, try larger
values until one oscillates)
-Choose initial LR smaller than threshold by about factor of two
-Learning schedule: similar to early stopping, but instead decrease by factor of two or ten when learning worsens
-Repeat until learning rate is factor 1000ish times lower than initial value
-Early experimentation use constant learning rate
-Regularization parameter (lmda):
-Start with no regularization, determine value for LR
-Using that value of LR, use validation data to select good value for lmda
-Start w/ 1, then increase or decrease by factors of 10
-Early stopping:
-Compute classification acc. on validation data at end of each epoch
-When acc. stops improving, terminate
-Early stages of experimentation might be better to turn off early stopping to see overfitting
-Use no-improvement-in-ten rule initially, slowly become more lenient
-Mini-batch size:
-Too small, don't get to take full advantage of benefits of good matrix libraries
-Too large, not updating weights often enough
-Use acceptable values for other hyper-parameters, then trial different mini-batch sizes
-Choose which size gives most rapid improvement in performance
-Automatically choosing parameters:
-Random search for hyper-parameter optimization by Bergstra and Bengio
-Manual parameter selection is not worse than grid search
-Always carefully monitor validation accuracy when tuning parameters
*****************************
Universality:
*************
-A neural network can compute any function
-Not exactly compute, but can come close (the more hidden neurons the closer we get)
-Only continuous functions can be truely computed
-Empirical evidence suggests that deep networks are the networks best adapted to learn the functions in
solving many real-world problems
*************
Deep neural networks:
*********************
-In at least some deep neural networks, gradient tends to get smaller as we move backwards thru hidden layers
-Vanishing gradient problem
-Gradient in deep neural networks is unstable (tends to explode or vanish in earlier layers)
*********************
Convolutional neural networks:
******************************
-Networks that use special architecture that specializes in classifying images
-Local receptive field: small window on input pixels of picture
-Each hidden neuron learns to analyze its particular receptive field
-Stride length: how far apart each receptive field is
-Shared weights/bias: all neurons in first hidden layer detect same feature, but in different locations in image
-Feature map: map from input layer to hidden layer
-Pooling layers: simplify information in the output from the convolutional layer
******************************
Recurrent neural networks: (https://deeplearning4j.org/lstm.html#recurrent)
**************************
-Networks in which there is some notion of dynamic change over time
-Neurons take into account what they learned one step back in time
-Neurons have two inputs: present and recent past, combined to determine how they respond to new data
-Long short-term memory units (LSTMs): help address unstable gradient in RNNs
-Useful for language recognition because language is presented in sequence, requiring network to remember
previous aspects of sentence
**************************