slides_9_orig.html

<!doctype html>
<html>
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no">
<meta name="author" content="Emre Neftci">
<title>Neural Networks and Machine Learning</title>

<link rel="stylesheet" href="css/reset.css">
<link rel="stylesheet" href="css/reveal.css">
<link rel="stylesheet" href="css/theme/nmilab.css">
<link rel="stylesheet" type="text/css" href="https://cdn.rawgit.com/dreampulse/computer-modern-web-font/master/fonts.css">
<link rel="stylesheet" type="text/css" href="https://maxcdn.bootstrapcdn.com/font-awesome/4.5.0/css/font-awesome.min.css">
<link rel="stylesheet" type="text/css" href="lib/css/monokai.css">

<!-- Printing and PDF exports -->
<script>
  var link = document.createElement( 'link' );
  link.rel = 'stylesheet';
  link.type = 'text/css';
  link.href = window.location.search.match( /print-pdf/gi ) ? 'css/print/pdf.css' : 'css/print/paper.css';
  document.getElementsByTagName( 'head' )[0].appendChild( link );
</script>
<script async defer src="https://buttons.github.io/buttons.js"></script>
</head>
<body>
<div class="reveal">
<div class="slides">

<!--
<section data-markdown><textarea data-template>
<h2> </h2>                                                     
<ul>
  <li/>
</ul>
</textarea></section>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/surrogate-gradient-learning/pytorch-lif-autograd/blob/master/tutorial01_dcll_localerrors.ipynb)
-->


<section data-markdown data-vertical-align-top data-background-color=#B2BA67><textarea data-template>
    <h1> Neural Networks and Machine Learning </h1>
    <h2> Week 9: Attention and Language Processing</h2>

    ### Instructors: Emre Neftci and Sameer Singh

  <center>https://canvas.eee.uci.edu/courses/21750</center>
  <center>http://tinyurl.com/nmi-lab-appointments</center>

  [![Print](pli/printer.svg)](?print-pdf)

</textarea>
</section>

<section data-markdown><textarea data-template>
<h2> Last Goody of the Quarter</h2>                                                     
<ul>
  <li/> Load and save models
<pre><code class="Python" data-trim data-noescape>
# Save network
torch.save(network.state_dict(), 'network.torch')
# Load network
network.load.state_dict(torch.load('network.pytorch'))
</code></pre>
</ul>
</textarea></section>


<section data-markdown><textarea data-template>
<h2> Recurrent Neural Networks in Deep Learning </h2>
<img src="images/RNN-unrolled.png" />

<ul>
  <li/> RNNs can be unfolded to form a deep neural network
    <li/> The depth along the unfolded dimension is equal to the number of time steps.
    <li/> An output can be produced at some or every time steps.
    <li/> Depending on the output structure, different problems can be solved
</ul>
</textarea></section>

<section data-markdown><textarea data-template>
<h2> Example Tasks</h2>                                                     
<img src="images/rnn_tasks.jpeg" />
<ul>
  <li class="fragment"/> Today: Sequence-to-sequence models for text-like data
</ul>
</textarea></section>

<section data-markdown><textarea data-template>
<h2> Word embeddings</h2>                                                     
<ul>
  <li /> Words can be represented as an <i>integer</i>, one for each word in a vocabularny
    <pre> 
      Let's represent this sequence 
      23    450        27  124
    </pre>
  <li /> The vocabulary size is the number of words we can represent. Typical sizes are 30000
  <li class=fragment /> A word is like one category among 30000 possibilities. I.e. it is a very sparse space and not practical to work in it.
    <li class=fragment /> Word embeddings are mappings that map a word into a real vector space of of smaller dimension (typicaly $\mathbb{R}^{256}$). 
    <pre> 
      Let's represent this sequence 
      0.3   1.3       0.1   1.3
     -0.5   2.3       5.5  -2.5
      ...   ...       ...   ...
      1.6   0.0      -3.2   8.2
    </pre>
  <li class=fragment /> Various techniques exist to learn this function (see second part of this class), let's assume this mapping exists
  
</ul>
</textarea></section>


<section data-markdown><textarea data-template>
<h2> Sequence-to-sequence Task</h2>                                                     
<ul>
  <li/> Embed words:
</ul>
<img src="images/seq2seq_step0.svg" class=large />
<p class=ref><a href=https://arxiv.org/pdf/1406.1078v3.pdf> Cho et al. 2014 </a>
</textarea></section>

<section data-markdown><textarea data-template>
<h2> Sequence-to-sequence Task</h2>                                                     
<ul>
  <li/> Translate English to German, for example:
</ul>
<img src="images/seq2seq_step1.svg" class=large />
</textarea></section>

<section data-markdown><textarea data-template>
<h2> Sequence-to-sequence Task</h2>                                                     
<ul>
  <li/> Encode sequence, <i>e.g</i> using an LSTM:
</ul>
<img src="images/seq2seq_step2.svg" class=large />
<img src="images/lstm.svg" class=small />
</textarea></section>

<section data-markdown><textarea data-template>
<h2> Sequence-to-sequence Task</h2>                                                     
<ul>
  <li/> Decode sentences
</ul>
<img src="images/seq2seq.svg" class=large />
<img src="images/lstm.svg" class=small />
</textarea></section>

<section data-markdown><textarea data-template>
<h2> Sequence-to-sequence Task</h2>                                                     
<ul>
  <li/> There is a problem in this network
</ul>
<img src="images/seq2seq.svg" class=large />
</textarea></section>

<section data-markdown><textarea data-template>
<h2> The Vanishing Gradients Problem</h2>                            
<p>Remember the temporal credit assignment problem</p>
<ul>
  <li /> Short-term dependencies
  <img src="images/RNN-shorttermdepdencies.png" class=small />
  <li /> Long-term dependencies
  <img src="images/RNN-longtermdependencies.png" class=small />
  <p class=ref>https://colah.github.io/posts/2015-08-Understanding-LSTMs/</p>
</ul>
</div>

</textarea></section>

<section data-markdown><textarea data-template>
<h2> Sequence-to-sequence Task</h2>                                                     
<ul>
  <li/> In addition to the vanishing gradient, the final encoded vector must encode the entire sentence
  <li /> While this is OK for classification (and even desirable), it is not appropriate for generating sequences.
</ul>
<img src="images/seq2seq_bottleneck.svg" class=large />
</textarea></section>

<section data-markdown><textarea data-template>
<h2> Long-Short Term Memory</h2>                                                     

<img src="images/LSTM3-chain.png" />
<p class=ref>https://colah.github.io/posts/2015-08-Understanding-LSTMs/</p>
<ul>
  <li />The top horizontal line is the memory state, $C_t$
  <li/> Let's go over the steps one by one
</ul>
</textarea></section>


<section data-markdown><textarea data-template>
    <h1> Sequence-to-Sequence Models with Attention</h1>
</textarea></section>

<section data-markdown><textarea data-template>
<h2>Attention: Motivation </h2>                                                     
<ul>
  <li /> Provides a solution to the bottleneck problem.
  <li class=fragment /> Allows the decoder can have direct access to all the hidden states of the encoder.
  <li class=fragment /> At each step, the decoder can focus on the part of the input sentence that is most relevant.
  <li class=fragment /> Attention mechanisms are internal transformations with trainable parameters: <em> they can be learned jointly with the rest of the model </em> 
</ul>
</textarea></section>

<section data-markdown  data-transition="fade-in"><textarea data-template>
<h2> Sequence-to-sequence Task With Attention</h2>                                                     
<ul>
  <li/> Generate 1st decoder hidden vector as before.
<img src="images/attention_singh_00.svg" class=large />
</ul>


</textarea></section>

<section data-markdown data-transition="fade-in"><textarea data-template>
<h2> Sequence-to-sequence Task</h2>                                                     
<ul>
  <li/> Calculate a similarity score between the decoder hidden vector and all the encoder hidden vectors.
  <img src="images/attention_singh_01.svg" class=large />
  <li /> Encoder hidden states $\mathbf{h} = h_1, ..., h_N \in \mathbb{R}^h$
</ul>
</textarea></section>

<section data-markdown data-transition="fade-in"><textarea data-template>
<h2> Sequence-to-sequence Task</h2>                                                     
<ul>
  <li/> Calculate a similarity score between the decoder hidden vector and all the encoder hidden vectors.
<img src="images/attention_singh_02.svg" class=large />
  <li  /> Decoder states $s_t \in \mathbb{R}^h$
  <li /> Similarity scores (alignment) $e_t = [s_t^\top h_1, ..., s_t^\top h_N] \in \mathbb{R}^N$
</ul>
</textarea></section>

<section data-markdown data-transition="fade-in"><textarea data-template>
<h2> Sequence-to-sequence Task</h2>                                                     
<ul>

  <li/> Pass similarity score vector through a softmax to generate attention weights.
<img src="images/attention_singh_03.svg" class=large />
  <li /> Attention weights $\alpha_t = softmax(e_t) \in \mathbb{R}^N$
</ul>
</textarea></section>

<section data-markdown data-transition="fade-in"><textarea data-template>
<h2> Sequence-to-sequence Task</h2>                                                     
<ul>
  <li/> Use the attention weights to generate the context vector by taking a weighted sum of the encoder hidden vectors.
<img src="images/attention_singh_04.svg" class=large />
<li /> Context vector is a weighted sum of the encoder states $a_t = \alpha_t \cdot \mathbf{h} \in \mathbb{R}^h$
</ul>
</textarea></section>

<section data-markdown data-transition="fade-in"><textarea data-template>
<h2> Sequence-to-sequence Task</h2>                                                     
<ul>
  <li />Concatenate the context vector to the decoder hidden vector and generate the first word.
<img src="images/attention_singh_06.svg" class=large />

</ul>
</textarea></section>

<section data-markdown data-transition="fade-in"><textarea data-template>
<h2> Sequence-to-sequence Task</h2>                                                     
<ul>
  <li />And Proceed as before
<img src="images/attention_singh_07.svg" class=large />
</ul>
</textarea></section>

<section data-markdown data-transition="fade-in"><textarea data-template>
<h2> Sequence-to-sequence Task</h2>                                                     
<ul>
  <li />Translation is complete when the END token is generated
<img src="images/attention_singh_08.svg" class=large />
<li /> Concatenate $a_t$ and decoder state $s_t$ together and train as in the non-attention model
</ul>
</textarea></section>


<section data-markdown data-transition="fade-in"><textarea data-template>
<h2> Attention: Interpretation </h2>                                                     
<ul>
  <li /> Alignments found. Each pixel shows the weight $\alpha_{ij}$ of the ith source word for the jth target word
<img src="images/alignment.png" class=large />
<p class=ref>Bahdenau et al. 2015</p>
<li /> Concatenate $a_t$ and decoder state $s_t$ together and train as in the non-attention model
</ul>
</textarea></section>


<section data-markdown><textarea data-template>
<h2> Attention: A more general definition</h2>                                                     
<ul>
  <li/>Definition: Given a set of <b>value vectors</b> (encoded state $h_t$) and a <b>query vector</b> (decoded state, $s_t$), attention is a technique to compute a weighted sum of the values, dependent on the query, using an attention function.
  <li/> The weighted sum is a selective summary of the information contained in the values, where the query determines which values to focus on.
  <li />There is a number of attention functions that are commonly used.
    <ul>
      <li /> Dot product attention: $$e_{ij} = \mathbf{s}^\top_j \mathbf{h}_i \in \mathbb{R}$$
      <li /> Luong et al. 2015: $$e_{ij} = \mathbf{s}^\top_j W \mathbf{h}_i \in \mathbb{R}$$
      <li /> Bahdenau et al. 2015: $$e_{ij} = V^\top \tanh(W_s \mathbf{s}_j + W_h \mathbf{h}_i) \in \mathbb{R}$$
</ul>
</textarea></section>


<section data-markdown><textarea data-template>
<h2> Stacked Attention </h2>                                                     
<img src="images/stacked_attention.svg" class=large />
</textarea></section>

<section data-markdown><textarea data-template>
<h2> Seq2Seq with Attention Tutorial (PyTorch) </h2>      
<img src="https://pytorch.org/tutorials/_images/seq2seq.png" class=large />

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://drive.google.com/open?id=1tThcNewnf1ChK36luba1IeTgxXnLhxgL)

</textarea></section>


</div>
</div>


<!-- End of slides -->
<script src="../reveal.js/lib/js/head.min.js"></script>
<script src="js/reveal.js"></script>
<script>
Reveal.configure({ pdfMaxPagesPerSlide: 1, hash: true, slideNumber: true})
Reveal.initialize({
mouseWheel: false,
width: 1280,
height: 720,
margin: 0.0,
navigationMode: 'grid',
transition: 'slide',
menu: { // Menu works best with font-awesome installed: sudo apt-get install fonts-font-awesome
    themes: false,
    transitions: false,
    markers: true,
    hideMissingTitles: true,
    custom: [
              { title: 'Plugins', icon: '<i class="fa fa-external-link-alt"></i>', src: 'toc.html' },
              { title: 'About', icon: '<i class="fa fa-info"></i>', src: 'about.html' }
          ]
  },
math: {
  mathjax: 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.0/MathJax.js',
  config: 'TeX-AMS_HTML-full', // See http://docs.mathjax.org/en/latest/config-files.html
  // pass other options into `MathJax.Hub.Config()`
  TeX: { Macros: { Dp: ["\\frac{\\partial #1}{\\partial #2}",2] }}
  //TeX: { Macros: { Dp#1#2:  },
},
chalkboard: { 
src:'slides_2-chalkboard.json',
penWidth : 1.0,
chalkWidth : 1.5,
chalkEffect : .5, 
readOnly: false, 
toggleChalkboardButton: { left: "80px" },
toggleNotesButton: { left: "130px" },
transition: 100,
theme: "whiteboard",
},
menu : { titleSelector: 'h1', hideMissingTitles: true,},
keyboard: {
67: function() { RevealChalkboard.toggleNotesCanvas() },	// toggle notes canvas when 'c' is pressed
66: function() { RevealChalkboard.toggleChalkboard() },	// toggle chalkboard when 'b' is pressed
46: function() { RevealChalkboard.reset() },	// reset chalkboard data on current slide when 'BACKSPACE' is pressed
68: function() { RevealChalkboard.download() },	// downlad recorded chalkboard drawing when 'd' is pressed
88: function() { RevealChalkboard.colorNext() },	// cycle colors forward when 'x' is pressed
89: function() { RevealChalkboard.colorPrev() },	// cycle colors backward when 'y' is pressed
},
dependencies: [
{ src: 'node_modules/reveald3/reveald3.js' },
{ src: '../reveal.js/lib/js/classList.js', condition: function() { return !document.body.classList; } },
{ src: 'plugin/markdown/marked.js' },
{ src: 'plugin/markdown/markdown.js' },
{ src: 'plugin/notes/notes.js', async: true },
{ src: 'plugin/highlight/highlight.js', async: true, languages: ["Python"] },
{ src: 'plugin/math/math.js', async: true },
{ src: 'external-plugins/chalkboard/chalkboard.js' },
//{ src: 'external-plugins/menu/menu.js'},
{ src: 'node_modules/reveal.js-menu/menu.js' }
]
});
</script>


<script type="text/bibliography">
  @article{gregor2015draw,
    title={DRAW: A recurrent neural network for image generation},
    author={Gregor, Karol and Danihelka, Ivo and Graves, Alex and Rezende, Danilo Jimenez and Wierstra, Daan},
    journal={arXivreprint arXiv:1502.04623},
    year={2015},
    url={https://arxiv.org/pdf/1502.04623.pdf}
  }
</script>
</body>
</html>

three types of modules:

- classification:
- encoder module, single state at the end encodes entire sentence is good for classification


word embeddings: tezguino

negative sampling to overcome softmax normalization. no need to do negative sampling for the sequence.


explain in #3 why we use log probabilities