-
-
Notifications
You must be signed in to change notification settings - Fork 0
/
slides_9_orig.html
421 lines (347 loc) · 15.8 KB
/
slides_9_orig.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
<!doctype html>
<html>
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no">
<meta name="author" content="Emre Neftci">
<title>Neural Networks and Machine Learning</title>
<link rel="stylesheet" href="css/reset.css">
<link rel="stylesheet" href="css/reveal.css">
<link rel="stylesheet" href="css/theme/nmilab.css">
<link rel="stylesheet" type="text/css" href="https://cdn.rawgit.com/dreampulse/computer-modern-web-font/master/fonts.css">
<link rel="stylesheet" type="text/css" href="https://maxcdn.bootstrapcdn.com/font-awesome/4.5.0/css/font-awesome.min.css">
<link rel="stylesheet" type="text/css" href="lib/css/monokai.css">
<!-- Printing and PDF exports -->
<script>
var link = document.createElement( 'link' );
link.rel = 'stylesheet';
link.type = 'text/css';
link.href = window.location.search.match( /print-pdf/gi ) ? 'css/print/pdf.css' : 'css/print/paper.css';
document.getElementsByTagName( 'head' )[0].appendChild( link );
</script>
<script async defer src="https://buttons.github.io/buttons.js"></script>
</head>
<body>
<div class="reveal">
<div class="slides">
<!--
<section data-markdown><textarea data-template>
<h2> </h2>
<ul>
<li/>
</ul>
</textarea></section>
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/surrogate-gradient-learning/pytorch-lif-autograd/blob/master/tutorial01_dcll_localerrors.ipynb)
-->
<section data-markdown data-vertical-align-top data-background-color=#B2BA67><textarea data-template>
<h1> Neural Networks and Machine Learning </h1>
<h2> Week 9: Attention and Language Processing</h2>
### Instructors: Emre Neftci and Sameer Singh
<center>https://canvas.eee.uci.edu/courses/21750</center>
<center>http://tinyurl.com/nmi-lab-appointments</center>
[![Print](pli/printer.svg)](?print-pdf)
</textarea>
</section>
<section data-markdown><textarea data-template>
<h2> Last Goody of the Quarter</h2>
<ul>
<li/> Load and save models
<pre><code class="Python" data-trim data-noescape>
# Save network
torch.save(network.state_dict(), 'network.torch')
# Load network
network.load.state_dict(torch.load('network.pytorch'))
</code></pre>
</ul>
</textarea></section>
<section data-markdown><textarea data-template>
<h2> Recurrent Neural Networks in Deep Learning </h2>
<img src="images/RNN-unrolled.png" />
<ul>
<li/> RNNs can be unfolded to form a deep neural network
<li/> The depth along the unfolded dimension is equal to the number of time steps.
<li/> An output can be produced at some or every time steps.
<li/> Depending on the output structure, different problems can be solved
</ul>
</textarea></section>
<section data-markdown><textarea data-template>
<h2> Example Tasks</h2>
<img src="images/rnn_tasks.jpeg" />
<ul>
<li class="fragment"/> Today: Sequence-to-sequence models for text-like data
</ul>
</textarea></section>
<section data-markdown><textarea data-template>
<h2> Word embeddings</h2>
<ul>
<li /> Words can be represented as an <i>integer</i>, one for each word in a vocabularny
<pre>
Let's represent this sequence
23 450 27 124
</pre>
<li /> The vocabulary size is the number of words we can represent. Typical sizes are 30000
<li class=fragment /> A word is like one category among 30000 possibilities. I.e. it is a very sparse space and not practical to work in it.
<li class=fragment /> Word embeddings are mappings that map a word into a real vector space of of smaller dimension (typicaly $\mathbb{R}^{256}$).
<pre>
Let's represent this sequence
0.3 1.3 0.1 1.3
-0.5 2.3 5.5 -2.5
... ... ... ...
1.6 0.0 -3.2 8.2
</pre>
<li class=fragment /> Various techniques exist to learn this function (see second part of this class), let's assume this mapping exists
</ul>
</textarea></section>
<section data-markdown><textarea data-template>
<h2> Sequence-to-sequence Task</h2>
<ul>
<li/> Embed words:
</ul>
<img src="images/seq2seq_step0.svg" class=large />
<p class=ref><a href=https://arxiv.org/pdf/1406.1078v3.pdf> Cho et al. 2014 </a>
</textarea></section>
<section data-markdown><textarea data-template>
<h2> Sequence-to-sequence Task</h2>
<ul>
<li/> Translate English to German, for example:
</ul>
<img src="images/seq2seq_step1.svg" class=large />
</textarea></section>
<section data-markdown><textarea data-template>
<h2> Sequence-to-sequence Task</h2>
<ul>
<li/> Encode sequence, <i>e.g</i> using an LSTM:
</ul>
<img src="images/seq2seq_step2.svg" class=large />
<img src="images/lstm.svg" class=small />
</textarea></section>
<section data-markdown><textarea data-template>
<h2> Sequence-to-sequence Task</h2>
<ul>
<li/> Decode sentences
</ul>
<img src="images/seq2seq.svg" class=large />
<img src="images/lstm.svg" class=small />
</textarea></section>
<section data-markdown><textarea data-template>
<h2> Sequence-to-sequence Task</h2>
<ul>
<li/> There is a problem in this network
</ul>
<img src="images/seq2seq.svg" class=large />
</textarea></section>
<section data-markdown><textarea data-template>
<h2> The Vanishing Gradients Problem</h2>
<p>Remember the temporal credit assignment problem</p>
<ul>
<li /> Short-term dependencies
<img src="images/RNN-shorttermdepdencies.png" class=small />
<li /> Long-term dependencies
<img src="images/RNN-longtermdependencies.png" class=small />
<p class=ref>https://colah.github.io/posts/2015-08-Understanding-LSTMs/</p>
</ul>
</div>
</textarea></section>
<section data-markdown><textarea data-template>
<h2> Sequence-to-sequence Task</h2>
<ul>
<li/> In addition to the vanishing gradient, the final encoded vector must encode the entire sentence
<li /> While this is OK for classification (and even desirable), it is not appropriate for generating sequences.
</ul>
<img src="images/seq2seq_bottleneck.svg" class=large />
</textarea></section>
<section data-markdown><textarea data-template>
<h2> Long-Short Term Memory</h2>
<img src="images/LSTM3-chain.png" />
<p class=ref>https://colah.github.io/posts/2015-08-Understanding-LSTMs/</p>
<ul>
<li />The top horizontal line is the memory state, $C_t$
<li/> Let's go over the steps one by one
</ul>
</textarea></section>
<section data-markdown><textarea data-template>
<h1> Sequence-to-Sequence Models with Attention</h1>
</textarea></section>
<section data-markdown><textarea data-template>
<h2>Attention: Motivation </h2>
<ul>
<li /> Provides a solution to the bottleneck problem.
<li class=fragment /> Allows the decoder can have direct access to all the hidden states of the encoder.
<li class=fragment /> At each step, the decoder can focus on the part of the input sentence that is most relevant.
<li class=fragment /> Attention mechanisms are internal transformations with trainable parameters: <em> they can be learned jointly with the rest of the model </em>
</ul>
</textarea></section>
<section data-markdown data-transition="fade-in"><textarea data-template>
<h2> Sequence-to-sequence Task With Attention</h2>
<ul>
<li/> Generate 1st decoder hidden vector as before.
<img src="images/attention_singh_00.svg" class=large />
</ul>
</textarea></section>
<section data-markdown data-transition="fade-in"><textarea data-template>
<h2> Sequence-to-sequence Task</h2>
<ul>
<li/> Calculate a similarity score between the decoder hidden vector and all the encoder hidden vectors.
<img src="images/attention_singh_01.svg" class=large />
<li /> Encoder hidden states $\mathbf{h} = h_1, ..., h_N \in \mathbb{R}^h$
</ul>
</textarea></section>
<section data-markdown data-transition="fade-in"><textarea data-template>
<h2> Sequence-to-sequence Task</h2>
<ul>
<li/> Calculate a similarity score between the decoder hidden vector and all the encoder hidden vectors.
<img src="images/attention_singh_02.svg" class=large />
<li /> Decoder states $s_t \in \mathbb{R}^h$
<li /> Similarity scores (alignment) $e_t = [s_t^\top h_1, ..., s_t^\top h_N] \in \mathbb{R}^N$
</ul>
</textarea></section>
<section data-markdown data-transition="fade-in"><textarea data-template>
<h2> Sequence-to-sequence Task</h2>
<ul>
<li/> Pass similarity score vector through a softmax to generate attention weights.
<img src="images/attention_singh_03.svg" class=large />
<li /> Attention weights $\alpha_t = softmax(e_t) \in \mathbb{R}^N$
</ul>
</textarea></section>
<section data-markdown data-transition="fade-in"><textarea data-template>
<h2> Sequence-to-sequence Task</h2>
<ul>
<li/> Use the attention weights to generate the context vector by taking a weighted sum of the encoder hidden vectors.
<img src="images/attention_singh_04.svg" class=large />
<li /> Context vector is a weighted sum of the encoder states $a_t = \alpha_t \cdot \mathbf{h} \in \mathbb{R}^h$
</ul>
</textarea></section>
<section data-markdown data-transition="fade-in"><textarea data-template>
<h2> Sequence-to-sequence Task</h2>
<ul>
<li />Concatenate the context vector to the decoder hidden vector and generate the first word.
<img src="images/attention_singh_06.svg" class=large />
</ul>
</textarea></section>
<section data-markdown data-transition="fade-in"><textarea data-template>
<h2> Sequence-to-sequence Task</h2>
<ul>
<li />And Proceed as before
<img src="images/attention_singh_07.svg" class=large />
</ul>
</textarea></section>
<section data-markdown data-transition="fade-in"><textarea data-template>
<h2> Sequence-to-sequence Task</h2>
<ul>
<li />Translation is complete when the END token is generated
<img src="images/attention_singh_08.svg" class=large />
<li /> Concatenate $a_t$ and decoder state $s_t$ together and train as in the non-attention model
</ul>
</textarea></section>
<section data-markdown data-transition="fade-in"><textarea data-template>
<h2> Attention: Interpretation </h2>
<ul>
<li /> Alignments found. Each pixel shows the weight $\alpha_{ij}$ of the ith source word for the jth target word
<img src="images/alignment.png" class=large />
<p class=ref>Bahdenau et al. 2015</p>
<li /> Concatenate $a_t$ and decoder state $s_t$ together and train as in the non-attention model
</ul>
</textarea></section>
<section data-markdown><textarea data-template>
<h2> Attention: A more general definition</h2>
<ul>
<li/>Definition: Given a set of <b>value vectors</b> (encoded state $h_t$) and a <b>query vector</b> (decoded state, $s_t$), attention is a technique to compute a weighted sum of the values, dependent on the query, using an attention function.
<li/> The weighted sum is a selective summary of the information contained in the values, where the query determines which values to focus on.
<li />There is a number of attention functions that are commonly used.
<ul>
<li /> Dot product attention: $$e_{ij} = \mathbf{s}^\top_j \mathbf{h}_i \in \mathbb{R}$$
<li /> Luong et al. 2015: $$e_{ij} = \mathbf{s}^\top_j W \mathbf{h}_i \in \mathbb{R}$$
<li /> Bahdenau et al. 2015: $$e_{ij} = V^\top \tanh(W_s \mathbf{s}_j + W_h \mathbf{h}_i) \in \mathbb{R}$$
</ul>
</textarea></section>
<section data-markdown><textarea data-template>
<h2> Stacked Attention </h2>
<img src="images/stacked_attention.svg" class=large />
</textarea></section>
<section data-markdown><textarea data-template>
<h2> Seq2Seq with Attention Tutorial (PyTorch) </h2>
<img src="https://pytorch.org/tutorials/_images/seq2seq.png" class=large />
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://drive.google.com/open?id=1tThcNewnf1ChK36luba1IeTgxXnLhxgL)
</textarea></section>
</div>
</div>
<!-- End of slides -->
<script src="../reveal.js/lib/js/head.min.js"></script>
<script src="js/reveal.js"></script>
<script>
Reveal.configure({ pdfMaxPagesPerSlide: 1, hash: true, slideNumber: true})
Reveal.initialize({
mouseWheel: false,
width: 1280,
height: 720,
margin: 0.0,
navigationMode: 'grid',
transition: 'slide',
menu: { // Menu works best with font-awesome installed: sudo apt-get install fonts-font-awesome
themes: false,
transitions: false,
markers: true,
hideMissingTitles: true,
custom: [
{ title: 'Plugins', icon: '<i class="fa fa-external-link-alt"></i>', src: 'toc.html' },
{ title: 'About', icon: '<i class="fa fa-info"></i>', src: 'about.html' }
]
},
math: {
mathjax: 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.0/MathJax.js',
config: 'TeX-AMS_HTML-full', // See http://docs.mathjax.org/en/latest/config-files.html
// pass other options into `MathJax.Hub.Config()`
TeX: { Macros: { Dp: ["\\frac{\\partial #1}{\\partial #2}",2] }}
//TeX: { Macros: { Dp#1#2: },
},
chalkboard: {
src:'slides_2-chalkboard.json',
penWidth : 1.0,
chalkWidth : 1.5,
chalkEffect : .5,
readOnly: false,
toggleChalkboardButton: { left: "80px" },
toggleNotesButton: { left: "130px" },
transition: 100,
theme: "whiteboard",
},
menu : { titleSelector: 'h1', hideMissingTitles: true,},
keyboard: {
67: function() { RevealChalkboard.toggleNotesCanvas() }, // toggle notes canvas when 'c' is pressed
66: function() { RevealChalkboard.toggleChalkboard() }, // toggle chalkboard when 'b' is pressed
46: function() { RevealChalkboard.reset() }, // reset chalkboard data on current slide when 'BACKSPACE' is pressed
68: function() { RevealChalkboard.download() }, // downlad recorded chalkboard drawing when 'd' is pressed
88: function() { RevealChalkboard.colorNext() }, // cycle colors forward when 'x' is pressed
89: function() { RevealChalkboard.colorPrev() }, // cycle colors backward when 'y' is pressed
},
dependencies: [
{ src: 'node_modules/reveald3/reveald3.js' },
{ src: '../reveal.js/lib/js/classList.js', condition: function() { return !document.body.classList; } },
{ src: 'plugin/markdown/marked.js' },
{ src: 'plugin/markdown/markdown.js' },
{ src: 'plugin/notes/notes.js', async: true },
{ src: 'plugin/highlight/highlight.js', async: true, languages: ["Python"] },
{ src: 'plugin/math/math.js', async: true },
{ src: 'external-plugins/chalkboard/chalkboard.js' },
//{ src: 'external-plugins/menu/menu.js'},
{ src: 'node_modules/reveal.js-menu/menu.js' }
]
});
</script>
<script type="text/bibliography">
@article{gregor2015draw,
title={DRAW: A recurrent neural network for image generation},
author={Gregor, Karol and Danihelka, Ivo and Graves, Alex and Rezende, Danilo Jimenez and Wierstra, Daan},
journal={arXivreprint arXiv:1502.04623},
year={2015},
url={https://arxiv.org/pdf/1502.04623.pdf}
}
</script>
</body>
</html>
three types of modules:
- classification:
- encoder module, single state at the end encodes entire sentence is good for classification
word embeddings: tezguino
negative sampling to overcome softmax normalization. no need to do negative sampling for the sequence.
explain in #3 why we use log probabilities