Articles

1992

Simple Statistical Gradient-Following for Connectionist Reinforcement Learning

Authors: Ronald J. Williams

Abstract: This article presents a general class of associative reinforcement learning algorithms for connectionist networks containing stochastic units. These algorithms, called REINFORCE algorithms, are shown to make weight adjustments in a direction that lies along the gradient of expected reinforcement in both immediate-reinforcement tasks and certain limited forms of delayed-reinforcement tasks, and they do this without explicitly computing gradient estimates or even storing information from which such estimates could be computed. Specific examples of such algorithms are presented, some of which bear a close relationship to certain existing algorithms while others are novel but potentially interesting in their own right. Also given are results that show how such algorithms can be naturally integrated with backpropagation. We close with a brief discussion of a number of additional issues surrounding the use of such algorithms, including what is known about their limiting behaviors as well as further considerations that might be used to help develop similar but potentially more powerful reinforcement learning algorithms.

URL: http://www-anw.cs.umass.edu/~barto/courses/cs687/williams92simple.pdf

Notes: The famous REINFORCE algorithm is presented here.

2009

A Scalable Hierarchical Distributed Language Model

Authors: Andriy Mnih, Geoffrey Hinton

Abstract: Neural probabilistic language models (NPLMs) have been shown to be competitive with and occasionally superior to the widely-used n-gram language models. The main drawback of NPLMs is their extremely long training and testing times. Morin and Bengio have proposed a hierarchical language model built around a binary tree of words, which was two orders of magnitude faster than the nonhierarchical model it was based on. However, it performed considerably worse than its non-hierarchical counterpart in spite of using a word tree created using expert knowledge. We introduce a fast hierarchical language model along with a simple feature-based algorithm for automatic construction of word trees from the data. We then show that the resulting models can outperform non-hierarchical neural models as well as the best n-gram models.

URL: http://papers.nips.cc/paper/3583-a-scalable-hierarchical-distributed-language-model.pdf

Notes: one of the first tries to optimize softmax, they build a tree for vocab with EM algorithm

2011

On optimization methods for deep learning

Authors: Quoc V. Le, Jiquan Ngiam, Adam Coates, Abhik Lahiri, Bobby Prochnow, Andrew Y. Ng

Abstract: The predominant methodology in training deep learning advocates the use of stochastic gradient descent methods (SGDs). Despite its ease of implementation, SGDs are difficult to tune and parallelize. These problems make it challenging to develop, debug and scale up deep learning algorithms with SGDs. In this paper, we show that more sophisticated off-the-shelf optimization methods such as Limited memory BFGS (L-BFGS) and Conjugate gradient (CG) with line search can significantly simplify and speed up the process of pretraining deep algorithms. In our experiments, the difference between LBFGS/CG and SGDs are more pronounced if we consider algorithmic extensions (e.g., sparsity regularization) and hardware extensions (e.g., GPUs or computer clusters). Our experiments with distributed optimization support the use of L-BFGS with locally connected networks and convolutional neural networks. Using L-BFGS, our convolutional network model achieves 0.69% on the standard MNIST dataset. This is a state-of-theart result on MNIST among algorithms that do not use distortions or pretraining.

URL: https://cs.stanford.edu/~acoates/papers/LeNgiCoaLahProNg11.pdf

Notes: old paper from Quoc Le & Andrew Ng on usage of conjugate grad & L-BFGS; they do give faster convergence

Natural Language Processing (almost) from Scratch

Authors: Ronan Collobert, Jason Weston, Leon Bottou, Michael Karlen, Koray Kavukcuoglu, Pavel Kuksa

Abstract: We propose a unified neural network architecture and learning algorithm that can be applied to various natural language processing tasks including: part-of-speech tagging, chunking, named entity recognition, and semantic role labeling. This versatility is achieved by trying to avoid task-specific engineering and therefore disregarding a lot of prior knowledge. Instead of exploiting man-made input features carefully optimized for each task, our system learns internal representations on the basis of vast amounts of mostly unlabeled training data. This work is then used as a basis for building a freely available tagging system with good performance and minimal computational requirements.

URL: https://arxiv.org/abs/1103.0398

Notes: fundamental paper in Neural NLP, it has almost everything you can imagine: NER, PoS tagging, etc.; multi-tasking; models: linear, conv+maxpool; unusual now hard tanh; linear instead of softmax; strangely no RNNs; unsupervised training for word emb; SGD

2012

A Fast and Simple Algorithm for Training Neural Probabilistic Language Models

Authors: Andriy Mnih, Yee Whye Teh

Abstract: In spite of their superior performance, neural probabilistic language models (NPLMs) remain far less widely used than n-gram models due to their notoriously long training times, which are measured in weeks even for moderately-sized datasets. Training NPLMs is computationally expensive because they are explicitly normalized, which leads to having to consider all words in the vocabulary when computing the log-likelihood gradients. We propose a fast and simple algorithm for training NPLMs based on noise-contrastive estimation, a newly introduced procedure for estimating unnormalized continuous distributions. We investigate the behaviour of the algorithm on the Penn Treebank corpus and show that it reduces the training times by more than an order of magnitude without affecting the quality of the resulting models. The algorithm is also more efficient and much more stable than importance sampling because it requires far fewer noise samples to perform well. We demonstrate the scalability of the proposed approach by training several neural language models on a 47M-word corpus with a 80K-word vocabulary, obtaining state-of-the-art results on the Microsoft Research Sentence Completion Challenge dataset.

URL: https://arxiv.org/abs/1206.6426

Notes: basic paper for Noise Contrastive Estimation

2013

Concurrent Reinforcement Learning from Customer Interactions.

Authors: D Silver, L Newnham, D Barker, S Weller, J McFall

Abstract: Abstract In this paper, we explore applications in which a company interacts concurrently with many customers. The company has an objective function, such as maximising revenue, customer satisfaction, or customer loyalty, which depends primarily on the sequence of interactions between company and customer. A key aspect of this setting is that interactions with different customers occur in parallel. As a result, it is imperative to learn online from partial interaction sequences, so that information acquired from one customer is efficiently assimilated and applied in subsequent interactions with other customers. We present the first framework for concurrent reinforcement learning, using a variant of temporal-difference learning to learn efficiently from partial interaction sequences. We evaluate our algorithms in two largescale test-beds for online and email interaction respectively, generated from a database of 300,000 customer records.

URL: http://www.jmlr.org/proceedings/papers/v28/silver13.pdf

Notes: Old paper from Silver about RL in practice.

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

Authors: Yoshua Bengio, Nicholas Léonard, Aaron Courville

Abstract: Stochastic neurons and hard non-linearities can be useful for a number of reasons in deep learning models, but in many cases they pose a challenging problem: how to estimate the gradient of a loss function with respect to the input of such stochastic or non-smooth neurons? I.e., can we "back-propagate" through these stochastic neurons? We examine this question, existing approaches, and compare four families of solutions, applicable in different settings. One of them is the minimum variance unbiased gradient estimator for stochatic binary neurons (a special case of the REINFORCE algorithm). A second approach, introduced here, decomposes the operation of a binary stochastic neuron into a stochastic binary part and a smooth differentiable part, which approximates the expected effect of the pure stochatic binary neuron to first order. A third approach involves the injection of additive or multiplicative noise in a computational graph that is otherwise differentiable. A fourth approach heuristically copies the gradient with respect to the stochastic output directly as an estimator of the gradient with respect to the sigmoid argument (we call this the straight-through estimator). To explore a context where these estimators are useful, we consider a small-scale version of {\em conditional computation}, where sparse stochastic units form a distributed representation of gaters that can turn off in combinatorially many ways large chunks of the computation performed in the rest of the neural network. In this case, it is important that the gating units produce an actual 0 most of the time. The resulting sparsity can be potentially be exploited to greatly reduce the computational cost of large deep networks for which conditional computation would be useful.

URL: https://arxiv.org/abs/1308.3432

Notes: Straight-Through estimator is just threshold activation, which is considered identity for backprop; it is biased, but still ok for one such layer

Learning word embeddings efficiently with noise-contrastive estimation

Authors: Andriy Mnih, Koray Kavukcuoglu

Abstract: Continuous-valued word embeddings learned by neural language models have recently been shown to capture semantic and syntactic information about words very well, setting performance records on several word similarity tasks. The best results are obtained by learning high-dimensional embeddings from very large quantities of data, which makes scalability of the training method a critical factor. We propose a simple and scalable new approach to learning word embeddings based on training log-bilinear models with noise-contrastive estimation. Our approach is simpler, faster, and produces better results than the current state-of-the art method of Mikolov et al. (2013a). We achieve results comparable to the best ones reported, which were obtained on a cluster, using four times less data and more than an order of magnitude less computing time. We also investigate several model types and find that the embeddings learned by the simpler models perform at least as well as those learned by the more complex ones.

URL: https://papers.nips.cc/paper/5165-learning-word-embeddings-efficiently-with-noise-contrastive-estimation

Notes: long missing paper on noise-contrasice estimation; NCE for word vectors introduced here; also positional encoding for window

Exploiting Similarities among Languages for Machine Translation

Authors: Tomas Mikolov, Quoc V. Le, Ilya Sutskever

Abstract: Dictionaries and phrase tables are the basis of modern statistical machine translation systems. This paper develops a method that can automate the process of generating and extending dictionaries and phrase tables. Our method can translate missing word and phrase entries by learning language structures based on large monolingual data and mapping between languages from small bilingual data. It uses distributed representation of words and learns a linear mapping between vector spaces of languages. Despite its simplicity, our method is surprisingly effective: we can achieve almost 90% precision@5 for translation of words between English and Spanish. This method makes little assumption about the languages, so it can be used to extend and refine dictionaries and translation tables for any language pairs.

URL: https://arxiv.org/abs/1309.4168

Notes: first paper to the best of my knowledge where exploted the idea of a) automatic mapping of monolingual embeddings and и) matrix transfrom for that mapping; also it's from Mikolov

2014

A Clockwork RNN

Authors: Jan Koutník, Klaus Greff, Faustino Gomez, Jürgen Schmidhuber

Abstract: Sequence prediction and classification are ubiquitous and challenging problems in machine learning that can require identifying complex dependencies between temporally distant inputs. Recurrent Neural Networks (RNNs) have the ability, in theory, to cope with these temporal dependencies by virtue of the short-term memory implemented by their recurrent (feedback) connections. However, in practice they are difficult to train successfully when the long-term memory is required. This paper introduces a simple, yet powerful modification to the standard RNN architecture, the Clockwork RNN (CW-RNN), in which the hidden layer is partitioned into separate modules, each processing inputs at its own temporal granularity, making computations only at its prescribed clock rate. Rather than making the standard RNN models more complex, CW-RNN reduces the number of RNN parameters, improves the performance significantly in the tasks tested, and speeds up the network evaluation. The network is demonstrated in preliminary experiments involving two tasks: audio signal generation and TIMIT spoken word classification, where it outperforms both RNN and LSTM networks.

URL: http://arxiv.org/abs/1402.3511

Notes: base paper for intermediate steps in RNNs

A Convolutional Neural Network for Modelling Sentences

Authors: Nal Kalchbrenner, Edward Grefenstette, Phil Blunsom

Abstract: The ability to accurately represent sentences is central to language understanding. We describe a convolutional architecture dubbed the Dynamic Convolutional Neural Network (DCNN) that we adopt for the semantic modelling of sentences. The network uses Dynamic k-Max Pooling, a global pooling operation over linear sequences. The network handles input sentences of varying length and induces a feature graph over the sentence that is capable of explicitly capturing short and long-range relations. The network does not rely on a parse tree and is easily applicable to any language. We test the DCNN in four experiments: small scale binary and multi-class sentiment prediction, six-way question classification and Twitter sentiment prediction by distant supervision. The network achieves excellent performance in the first three tasks and a greater than 25% error reduction in the last task with respect to the strongest baseline.

URL: https://arxiv.org/abs/1404.2188

Notes: very nice idea of dynamic k-max pooling, so we could produce trees (effectively) with convolutions

Neural Machine Translation by Jointly Learning to Align and Translate

Authors: Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio

Abstract: Neural machine translation is a recently proposed approach to machine translation. Unlike the traditional statistical machine translation, the neural machine translation aims at building a single neural network that can be jointly tuned to maximize the translation performance. The models proposed recently for neural machine translation often belong to a family of encoder-decoders and consists of an encoder that encodes a source sentence into a fixed-length vector from which a decoder generates a translation. In this paper, we conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and propose to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly. With this new approach, we achieve a translation performance comparable to the existing state-of-the-art phrase-based system on the task of English-to-French translation. Furthermore, qualitative analysis reveals that the (soft-)alignments found by the model agree well with our intuition.

URL: https://arxiv.org/abs/1409.0473

Notes: missing for some unknown reason great paper from Bahdanau with introduction of attention mechanism; greatly improved NMT SotA then; attention itself is a sum of encoder outputs weighted by similarity of each output to current decoder context (hidden state)

2015

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Authors: Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, Yoshua Bengio

Abstract: Inspired by recent work in machine translation and object detection, we introduce an attention based model that automatically learns to describe the content of images. We describe how we can train this model in a deterministic manner using standard backpropagation techniques and stochastically by maximizing a variational lower bound. We also show through visualization how the model is able to automatically learn to fix its gaze on salient objects while generating the corresponding words in the output sequence. We validate the use of attention with state-of-the-art performance on three benchmark datasets: Flickr8k, Flickr30k and MS COCO.

URL: https://arxiv.org/abs/1502.03044

Notes: baseline paper in image captioning, but more important: good explanation of stochastic "hard" attention

Cyclical Learning Rates for Training Neural Networks

Authors: Leslie N. Smith

Abstract: It is known that the learning rate is the most important hyper-parameter to tune for training deep neural networks. This paper describes a new method for setting the learning rate, named cyclical learning rates, which practically eliminates the need to experimentally find the best values and schedule for the global learning rates. Instead of monotonically decreasing the learning rate, this method lets the learning rate cyclically vary between reasonable boundary values. Training with cyclical learning rates instead of fixed values achieves improved classification accuracy without a need to tune and often in fewer iterations. This paper also describes a simple way to estimate "reasonable bounds" -- linearly increasing the learning rate of the network for a few epochs. In addition, cyclical learning rates are demonstrated on the CIFAR-10 and CIFAR-100 datasets with ResNets, Stochastic Depth networks, and DenseNets, and the ImageNet dataset with the AlexNet and GoogLeNet architectures. These are practical tools for everyone who trains neural networks.

URL: https://arxiv.org/abs/1506.01186

Notes: self-describing title, seems to really have some potential - with CLR the learning curve is steeper, could be useful, when we need fast training

A Neural Attention Model for Abstractive Sentence Summarization

Authors: Alexander M. Rush, Sumit Chopra, Jason Weston

Abstract: Summarization based on text extraction is inherently limited, but generation-style abstractive methods have proven challenging to build. In this work, we propose a fully data-driven approach to abstractive sentence summarization. Our method utilizes a local attention-based model that generates each word of the summary conditioned on the input sentence. While the model is structurally simple, it can easily be trained end-to-end and scales to a large amount of training data. The model shows significant performance gains on the DUC-2004 shared task compared with several strong baselines.

URL: https://arxiv.org/abs/1509.00685

Notes: simple paper for today: make summarization using Bahdanau attention, but back in a day it was hot

Unitary Evolution Recurrent Neural Networks

Authors: Martin Arjovsky, Amar Shah, Yoshua Bengio

Abstract: Recurrent neural networks (RNNs) are notoriously difficult to train. When the eigenvalues of the hidden to hidden weight matrix deviate from absolute value 1, optimization becomes difficult due to the well studied issue of vanishing and exploding gradients, especially when trying to learn long-term dependencies. To circumvent this problem, we propose a new architecture that learns a unitary weight matrix, with eigenvalues of absolute value exactly 1. The challenge we address is that of parametrizing unitary matrices in a way that does not require expensive computations (such as eigendecomposition) after each weight update. We construct an expressive unitary weight matrix by composing several structured matrices that act as building blocks with parameters to be learned. Optimization with this parameterization becomes feasible only when considering hidden states in the complex domain. We demonstrate the potential of this architecture by achieving state of the art results in several hard tasks involving very long-term dependencies.

URL: https://arxiv.org/abs/1511.06464

Notes:

Adversarial Autoencoders

Authors: Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, Brendan Frey

Abstract: In this paper, we propose the "adversarial autoencoder" (AAE), which is a probabilistic autoencoder that uses the recently proposed generative adversarial networks (GAN) to perform variational inference by matching the aggregated posterior of the hidden code vector of the autoencoder with an arbitrary prior distribution. Matching the aggregated posterior to the prior ensures that generating from any part of prior space results in meaningful samples. As a result, the decoder of the adversarial autoencoder learns a deep generative model that maps the imposed prior to the data distribution. We show how the adversarial autoencoder can be used in applications such as semi-supervised classification, disentangling style and content of images, unsupervised clustering, dimensionality reduction and data visualization. We performed experiments on MNIST, Street View House Numbers and Toronto Face datasets and show that adversarial autoencoders achieve competitive results in generative modeling and semi-supervised classification tasks.

URL: https://arxiv.org/abs/1511.05644

Notes: Basic paper for adversarial autoencoders.

A Theoretically Grounded Application of Dropout in Recurrent Neural Networks

Authors: Yarin Gal, Zoubin Ghahramani

Abstract: Recurrent neural networks (RNNs) stand at the forefront of many recent developments in deep learning. Yet a major difficulty with these models is their tendency to overfit, with dropout shown to fail when applied to recurrent layers. Recent results at the intersection of Bayesian modelling and deep learning offer a Bayesian interpretation of common deep learning techniques such as dropout. This grounding of dropout in approximate Bayesian inference suggests an extension of the theoretical results, offering insights into the use of dropout with RNN models. We apply this new variational inference based dropout technique in LSTM and GRU models, assessing it on language modelling and sentiment analysis tasks. The new approach outperforms existing techniques, and to the best of our knowledge improves on the single model state-of-the-art in language modelling with the Penn Treebank (73.4 test perplexity). This extends our arsenal of variational tools in deep learning.

URL: https://arxiv.org/abs/1512.05287

Notes: Bayesianly grounded dropout for RNNs.

Network Representation Learning with Rich Text Information

Authors: Cheng Yang, Zhiyuan Liu, Deli Zhao, Maosong Sun, Edward Y. Chang

Abstract: Representation learning has shown its effectiveness in many tasks such as image classification and text mining. Network representation learning aims at learning distributed vector representation for each vertex in a network, which is also increasingly recognized as an important aspect for network analysis. Most network representation learning methods investigate network structures for learning. In reality, network vertices contain rich information (such as text), which cannot be well applied with algorithmic frameworks of typical representation learning methods. By proving that DeepWalk, a state-of-the-art network representation method, is actually equivalent to matrix factorization (MF), we propose text-associated DeepWalk (TADW). TADW incorporates text features of vertices into network representation learning under the framework of matrix factorization. We evaluate our method and various baseline methods by applying them to the task of multi-class classification of vertices. The experimental results show that, our method outperforms other baselines on all three datasets, especially when networks are noisy and training ratio is small. The source code of this paper can be obtained from https://github.com/albertyang33/TADW .

URL: https://www.ijcai.org/Proceedings/15/Papers/299.pdf

Notes: DeepWalk (SkipGram on random walk sequences over a graph) with text features

Language Understanding for Text-based Games Using Deep Reinforcement Learning

Authors: Karthik Narasimhan, Tejas Kulkarni, Regina Barzilay

Abstract: In this paper, we consider the task of learning control policies for text-based games. In these games, all interactions in the virtual world are through text and the underlying state is not observed. The resulting language barrier makes such environments challenging for automatic game players. We employ a deep reinforcement learning framework to jointly learn state representations and action policies using game rewards as feedback. This framework enables us to map text descriptions into vector representations that capture the semantics of the game states. We evaluate our approach on two game worlds, comparing against baselines using bag-of-words and bag-of-bigrams for state representations. Our algorithm outperforms the baselines on both worlds demonstrating the importance of learning expressive representations.

URL: https://arxiv.org/abs/1506.08941

Notes: text quest RL-solving (LSTM-DQN they call it) with code for generation environment

Neural Machine Translation of Rare Words with Subword Units

Authors: Rico Sennrich, Barry Haddow, Alexandra Birch

Abstract: Neural machine translation (NMT) models typically operate with a fixed vocabulary, but translation is an open-vocabulary problem. Previous work addresses the translation of out-of-vocabulary words by backing off to a dictionary. In this paper, we introduce a simpler and more effective approach, making the NMT model capable of open-vocabulary translation by encoding rare and unknown words as sequences of subword units. This is based on the intuition that various word classes are translatable via smaller units than words, for instance names (via character copying or transliteration), compounds (via compositional translation), and cognates and loanwords (via phonological and morphological transformations). We discuss the suitability of different word segmentation techniques, including simple character n-gram models and a segmentation based on the byte pair encoding compression algorithm, and empirically show that subword models improve over a back-off dictionary baseline for the WMT 15 translation tasks English-German and English-Russian by 1.1 and 1.3 BLEU, respectively.

URL: https://arxiv.org/abs/1508.07909

Notes: BPE for machine translation was proposed here, now it is widely used outside NMT also

2016-01

Incorporating Structural Alignment Biases into an Attentional Neural Translation Model

Authors: Trevor Cohn, Cong Duy Vu Hoang, Ekaterina Vymolova, Kaisheng Yao, Chris Dyer, Gholamreza Haffari

Abstract: Neural encoder-decoder models of machine translation have achieved impressive results, rivalling traditional translation models. However their modelling formulation is overly simplistic, and omits several key inductive biases built into traditional models. In this paper we extend the attentional neural translation model to include structural biases from word based alignment models, including positional bias, Markov conditioning, fertility and agreement over translation directions. We show improvements over a baseline attentional model and standard phrase-based model over several language pairs, evaluating on difficult languages in a low resource setting.

URL: https://arxiv.org/abs/1601.01085

Notes: interesting paper on attention in NMT: adding positional bias and fertility penalty for alignment improving results of the model

2016-02

Exploring the Limits of Language Modeling

Authors: Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, Yonghui Wu

Abstract: In this work we explore recent advances in Recurrent Neural Networks for large scale Language Modeling, a task central to language understanding. We extend current models to deal with two key challenges present in this task: corpora and vocabulary sizes, and complex, long term structure of language. We perform an exhaustive study on techniques such as character Convolutional Neural Networks or Long-Short Term Memory, on the One Billion Word Benchmark. Our best single model significantly improves state-of-the-art perplexity from 51.3 down to 30.0 (whilst reducing the number of parameters by a factor of 20), while an ensemble of models sets a new record by improving perplexity from 41.0 down to 23.7. We also release these models for the NLP and ML community to study and improve upon.

URL: https://arxiv.org/abs/1602.02410

Notes: introduction of importance sampling for softmax, aside of exostive study of LM

Massively Multilingual Word Embeddings

Authors: Waleed Ammar, George Mulcaire, Yulia Tsvetkov, Guillaume Lample, Chris Dyer, Noah A. Smith

Abstract: We introduce new methods for estimating and evaluating embeddings of words in more than fifty languages in a single shared embedding space. Our estimation methods, multiCluster and multiCCA, use dictionaries and monolingual data; they do not require parallel data. Our new evaluation method, multiQVEC-CCA, is shown to correlate better than previous ones with two downstream tasks (text categorization and parsing). We also describe a web portal for evaluation that will facilitate further research in this area, along with open-source releases of all our methods.

URL: https://arxiv.org/abs/1602.01925

Notes: multilingual embeddings trained each on own dataset and mapped to one by dictionary of parallel words

2016-03

A Persona-Based Neural Conversation Model

Authors: Jiwei Li, Michel Galley, Chris Brockett, Georgios P. Spithourakis, Jianfeng Gao, Bill Dolan

Abstract: We present persona-based models for handling the issue of speaker consistency in neural response generation. A speaker model encodes personas in distributed embeddings that capture individual characteristics such as background information and speaking style. A dyadic speaker-addressee model captures properties of interactions between two interlocutors. Our models yield qualitative performance improvements in both perplexity and BLEU scores over baseline sequence-to-sequence models, with similar gains in speaker consistency as measured by human judges.

URL: http://nlp.stanford.edu/pubs/jiwei2016Persona.pdf

Notes: people do the additional embedding for speaking person, works really nice on dialog bots

Bayesian Neural Word Embedding

Authors: Oren Barkan

Abstract: Recently, several works in the domain of natural language processing presented successful methods for word embedding. Among them, the Skip-Gram with negative sampling, known also as word2vec, advanced the state-of-the-art of various linguistics tasks. In this paper, we propose a scalable Bayesian neural word embedding algorithm. The algorithm relies on a Variational Bayes solution for the Skip-Gram objective and a detailed step by step description is provided. We present experimental results that demonstrate the performance of the proposed algorithm for word analogy and similarity tasks on six different datasets and show it is competitive with the original Skip-Gram method.

URL: https://arxiv.org/abs/1603.06571

Notes: first usage of reparamentization trick on word embeddings (word2vec in particular) which I'm aware of; the idea is to produce mu and sigma for embeddings instead of actual embeddings in skip-gram optimization procedure

Sentence Pair Scoring: Towards Unified Framework for Text Comprehension

Authors: Petr Baudiš, Jan Pichl, Tomáš Vyskočil, Jan Šedivý

Abstract: We review the task of Sentence Pair Scoring, popular in the literature in various forms - viewed as Answer Sentence Selection, Semantic Text Scoring, Next Utterance Ranking, Recognizing Textual Entailment, Paraphrasing or e.g. a component of Memory Networks. We argue that all such tasks are similar from the model perspective and propose new baselines by comparing the performance of common IR metrics and popular convolutional, recurrent and attention-based neural models across many Sentence Pair Scoring tasks and datasets. We discuss the problem of evaluating randomized models, propose a statistically grounded methodology, and attempt to improve comparisons by releasing new datasets that are much harder than some of the currently used well explored benchmarks. We introduce a unified open source software framework with easily pluggable models and tasks, which enables us to experiment with multi-task reusability of trained sentence model. We set a new state-of-art in performance on the Ubuntu Dialogue dataset.

URL: http://arxiv.org/abs/1603.06127

Notes:

End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF

Authors: Xuezhe Ma, Eduard Hovy

Abstract: State-of-the-art sequence labeling systems traditionally require large amounts of task-specific knowledge in the form of hand-crafted features and data pre-processing. In this paper, we introduce a novel neutral network architecture that benefits from both word- and character-level representations automatically, by using combination of bidirectional LSTM, CNN and CRF. Our system is truly end-to-end, requiring no feature engineering or data pre-processing, thus making it applicable to a wide range of sequence labeling tasks. We evaluate our system on two data sets for two sequence labeling tasks --- Penn Treebank WSJ corpus for part-of-speech (POS) tagging and CoNLL 2003 corpus for named entity recognition (NER). We obtain state-of-the-art performance on both the two data --- 97.55% accuracy for POS tagging and 91.21% F1 for NER.

URL: https://arxiv.org/abs/1603.01354

Notes:

How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation

Authors: Chia-Wei Liu, Ryan Lowe, Iulian V. Serban, Michael Noseworthy, Laurent Charlin, Joelle Pineau

Abstract: We investigate evaluation metrics for dialogue response generation systems where supervised labels, such as task completion, are not available. Recent works in response generation have adopted metrics from machine translation to compare a model's generated response to a single target response. We show that these metrics correlate very weakly with human judgements in the non-technical Twitter domain, and not at all in the technical Ubuntu domain. We provide quantitative and qualitative results highlighting specific weaknesses in existing metrics, and provide recommendations for future development of better automatic evaluation metrics for dialogue systems.

URL: https://arxiv.org/abs/1603.08023

Notes: nice paper showing no statistical correlation between famous ROUGE, BLEU & METEOR with human judgement in dialog system scoring

Incorporating Copying Mechanism in Sequence-to-Sequence Learning

Authors: Jiatao Gu, Zhengdong Lu, Hang Li, Victor O.K. Li

Abstract: We address an important problem in sequence-to-sequence (Seq2Seq) learning referred to as copying, in which certain segments in the input sequence are selectively replicated in the output sequence. A similar phenomenon is observable in human language communication. For example, humans tend to repeat entity names or even long phrases in conversation. The challenge with regard to copying in Seq2Seq is that new machinery is needed to decide when to perform the operation. In this paper, we incorporate copying into neural network-based Seq2Seq learning and propose a new model called CopyNet with encoder-decoder structure. CopyNet can nicely integrate the regular way of word generation in the decoder with the new copying mechanism which can choose sub-sequences in the input sequence and put them at proper places in the output sequence. Our empirical study on both synthetic data sets and real world data sets demonstrates the efficacy of CopyNet. For example, CopyNet can outperform regular RNN-based model with remarkable margins on text summarization tasks.

URL: https://arxiv.org/abs/1603.06393

Notes: first mention of copy from input for OOV word technique in seq2seq

Zero-Shot Learning of Intent Embeddings for Expansion by Convolutional Deep Structured Semantic Models

Authors: Yun-Nung (Vivian) Chen, Dilek Hakkani-Tur, Xiaodong He

Abstract: The recent surge of intelligent personal assistants motivates spoken language understanding of dialogue systems. However, the domain constraint along with the inflexible intent schema remains a big issue. This paper focuses on the task of intent expansion, which helps remove the domain limit and make an intent schema flexible. A con-volutional deep structured semantic model (CDSSM) is applied to jointly learn the representations for human intents and associated utterances. Then it can flexibly generate new intent embeddings without the need of training samples and model-retraining, which bridges the semantic relation between seen and unseen intents and further performs more robust results. Experiments show that CDSSM is capable of performing zero-shot learning effectively, e.g. generating embeddings of previously unseen intents, and therefore expand to new intents without retraining , and outperforms other semantic embeddings. The discussion and analysis of experiments provide a future direction for reducing human effort about annotating data and removing the domain constraint in spoken dialogue systems. Index Terms— zero-shot learning, spoken language understanding (SLU), spoken dialogue system (SDS), convolutional deep structured semantic model (CDSSM), embeddings, expansion.

URL: https://www.researchgate.net/publication/292143255_Zero-Shot_Learning_of_Intent_Embeddings_for_Expansion_by_Convolutional_Deep_Structured_Semantic_Models

Notes: conv DSSM with ability to generate embeddings for new intents in dialog

2016-05

Learning End-to-End Goal-Oriented Dialog

Authors: Antoine Bordes, Y-Lan Boureau, Jason Weston

Abstract: Traditional dialog systems used in goal-oriented applications require a lot of domain-specific handcrafting, which hinders scaling up to new domains. End-to-end dialog systems, in which all components are trained from the dialogs themselves, escape this limitation. But the encouraging success recently obtained in chit-chat dialog may not carry over to goal-oriented settings. This paper proposes a testbed to break down the strengths and shortcomings of end-to-end dialog systems in goal-oriented applications. Set in the context of restaurant reservation, our tasks require manipulating sentences and symbols, so as to properly conduct conversations, issue API calls and use the outputs of such calls. We show that an end-to-end dialog system based on Memory Networks can reach promising, yet imperfect, performance and learn to perform non-trivial operations. We confirm those results by comparing our system to a hand-crafted slot-filling baseline on data from the second Dialog State Tracking Challenge (Henderson et al., 2014a). We show similar result patterns on data extracted from an online concierge service.

URL: https://arxiv.org/abs/1605.07683

Notes: memn2n for dialog systems

2016-06

End-to-end LSTM-based dialog control optimized with supervised and reinforcement learning

Authors: Jason D. Williams, Geoffrey Zweig

Abstract: This paper presents a model for end-to-end learning of task-oriented dialog systems. The main component of the model is a recurrent neural network (an LSTM), which maps from raw dialog history directly to a distribution over system actions. The LSTM automatically infers a representation of dialog history, which relieves the system developer of much of the manual feature engineering of dialog state. In addition, the developer can provide software that expresses business rules and provides access to programmatic APIs, enabling the LSTM to take actions in the real world on behalf of the user. The LSTM can be optimized using supervised learning (SL), where a domain expert provides example dialogs which the LSTM should imitate; or using reinforcement learning (RL), where the system improves by interacting directly with end users. Experiments show that SL and RL are complementary: SL alone can derive a reasonable initial policy from a small number of training dialogs; and starting RL optimization with a policy trained with SL substantially accelerates the learning rate of RL.

URL: https://arxiv.org/abs/1606.01269

Notes: SL/RL joint approach to make a dialog system, state is stored in LSTM, and actions (like in slot-systems) are chosen by RL-agent; active learning

Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations

Authors: David Krueger, Tegan Maharaj, János Kramár, Mohammad Pezeshki, Nicolas Ballas, Nan Rosemary Ke, Anirudh Goyal, Yoshua Bengio, Aaron Courville, Chris Pal

Abstract: We propose zoneout, a novel method for regularizing RNNs. At each timestep, zoneout stochastically forces some hidden units to maintain their previous values. Like dropout, zoneout uses random noise to train a pseudo-ensemble, improving generalization. But by preserving instead of dropping hidden units, gradient information and state information are more readily propagated through time, as in feedforward stochastic depth networks. We perform an empirical investigation of various RNN regularizers, and find that zoneout gives significant performance improvements across tasks. We achieve competitive results with relatively simple models in character- and word-level language modelling on the Penn Treebank and Text8 datasets, and combining with recurrent batch normalization yields state-of-the-art results on permuted sequential MNIST.

URL: https://arxiv.org/abs/1606.01305

Notes: recently updated paper about zoneout, a dropout for RNNs

A Decomposable Attention Model for Natural Language Inference

Authors: Ankur P. Parikh, Oscar Täckström, Dipanjan Das, Jakob Uszkoreit

Abstract: We propose a simple neural architecture for natural language inference. Our approach uses attention to decompose the problem into subproblems that can be solved separately, thus making it trivially parallelizable. On the Stanford Natural Language Inference (SNLI) dataset, we obtain state-of-the-art results with almost an order of magnitude fewer parameters than previous work and without relying on any word-order information. Adding intra-sentence attention that takes a minimum amount of order into account yields further improvements.

URL: https://arxiv.org/abs/1606.01933

Notes: attention-based alignment matrix for texts comparison, SOTA past year

2016-07

Representation learning for very short texts using weighted word embedding aggregation

Authors: Cedric De Boom, Steven Van Canneyt, Thomas Demeester, Bart Dhoedt

Abstract: Short text messages such as tweets are very noisy and sparse in their use of vocabulary. Traditional textual representations, such as tf-idf, have difficulty grasping the semantic meaning of such texts, which is important in applications such as event detection, opinion mining, news recommendation, etc. We constructed a method based on semantic word embeddings and frequency information to arrive at low-dimensional representations for short texts designed to capture semantic similarity. For this purpose we designed a weight-based model and a learning procedure based on a novel median-based loss function. This paper discusses the details of our model and the optimization methods, together with the experimental results on both Wikipedia and Twitter data. We find that our method outperforms the baseline approaches in the experiments, and that it generalizes well on different word embeddings without retraining. Our method is therefore capable of retaining most of the semantic information in the text, and is applicable out-of-the-box.

URL: http://arxiv.org/abs/1607.00570

Notes:

Recurrent Neural Machine Translation

Authors: Biao Zhang, Deyi Xiong, Jinsong Su

Abstract: The vanilla attention-based neural machine translation has achieved promising performance because of its capability in leveraging varying-length source annotations. However, this model still suffers from failures in long sentence translation, for its incapability in capturing long-term dependencies. In this paper, we propose a novel recurrent neural machine translation (RNMT), which not only preserves the ability to model varying-length source annotations but also better captures long-term dependencies. Instead of the conventional attention mechanism, RNMT employs a recurrent neural network to extract the context vector, where the target-side previous hidden state serves as its initial state, and the source annotations serve as its inputs. We refer to this new component as contexter. As the encoder, contexter and decoder in our model are all derivable recurrent neural networks, our model can still be trained end-to-end on large-scale corpus via stochastic algorithms. Experiments on Chinese-English translation tasks demonstrate the superiority of our model to attention-based neural machine translation, especially on long sentences. Besides, further analysis of the contexter revels that our model can implicitly reflect the alignment to source sentence.

URL: http://arxiv.org/abs/1607.08725

Notes:

Convolutional Neural Networks Analyzed via Convolutional Sparse Coding

Authors: Papyan Vardan, Yaniv Romano, Michael Elad

Abstract: Convolutional neural networks (CNN) have led to remarkable results in various fields. In this scheme, a signal is convolved with learned filters and a non-linear function is applied on the response map. The obtained result is then fed to another layer that operates similarly, thereby creating a multi-layered structure. Despite its empirical success, a theoretical understanding of this scheme, termed forward pass, is lacking. Another popular paradigm is the sparse representation model, which assumes that a signal can be described as the multiplication of a dictionary by a sparse vector. A special case of this is the convolutional sparse coding (CSC) model, in which the dictionary assumes a convolutional structure. Unlike CNN, sparsity inspired models are accompanied by a thorough theoretical analysis. Indeed, such a study of the CSC model has been performed in a recent two-part work, establishing it as a reliable alternative to the common patch-based processing. Herein, we leverage the study of the CSC model, and bring a fresh view to CNN with a deeper theoretical understanding. Our analysis relies on the observation that akin to the signal, the sparse vector can also be modeled as a sparse composition of yet another set of atoms from a convolutional dictionary. This can be extended to more than two layers, resulting in our proposed multi-layered convolutional sparse model. In this work we address the following questions: 1) What is the relation between the CNN and the proposed model? 2) In particular, can we interpret the forward pass as a pursuit? 3) If so, can we leverage this connection to provide a theoretical foundation for the forward pass? Specifically, is this algorithm guaranteed to succeed under certain conditions? Is it stable to slight perturbations in its input? 4) Lastly, can we leverage the answers to the above, and propose alternatives to CNN's forward pass?

URL: http://arxiv.org/abs/1607.08194

Notes:

Tweet2Vec: Learning Tweet Embeddings Using Character-level CNN-LSTM Encoder-Decoder

Authors: Soroush Vosoughi, Prashanth Vijayaraghavan, Deb Roy

Abstract: We present Tweet2Vec, a novel method for generating general-purpose vector representation of tweets. The model learns tweet embeddings using character-level CNN-LSTM encoder-decoder. We trained our model on 3 million, randomly selected English-language tweets. The model was evaluated using two methods: tweet semantic similarity and tweet sentiment categorization, outperforming the previous state-of-the-art in both tasks. The evaluations demonstrate the power of the tweet embeddings generated by our model for various tweet categorization tasks. The vector representations generated by our model are generic, and hence can be applied to a variety of tasks. Though the model presented in this paper is trained on English-language tweets, the method presented can be used to learn tweet embeddings for different languages.

URL: http://arxiv.org/abs/1607.07514

Notes:

Machine Learned Resume-Job Matching Solution

Authors: Yiou Lin, Hang Lei, Prince Clement Addo, Xiaoyu Li

Abstract: Job search through online matching engines nowadays are very prominent and beneficial to both job seekers and employers. But the solutions of traditional engines without understanding the semantic meanings of different resumes have not kept pace with the incredible changes in machine learning techniques and computing capability. These solutions are usually driven by manual rules and predefined weights of keywords which lead to an inefficient and frustrating search experience. To this end, we present a machine learned solution with rich features and deep learning methods. Our solution includes three configurable modules that can be plugged with little restrictions. Namely, unsupervised feature extraction, base classifiers training and ensemble method learning. In our solution, rather than using manual rules, machine learned methods to automatically detect the semantic similarity of positions are proposed. Then four competitive "shallow" estimators and "deep" estimators are selected. Finally, ensemble methods to bag these estimators and aggregate their individual predictions to form a final prediction are verified. Experimental results of over 47 thousand resumes show that our solution can significantly improve the predication precision current position, salary, educational background and company scale.

URL: http://arxiv.org/abs/1607.07657

Notes:

Automatic Attribute Discovery with Neural Activations

Authors: Sirion Vittayakorn, Takayuki Umeda, Kazuhiko Murasaki, Kyoko Sudo, Takayuki Okatani, Kota Yamaguchi

Abstract: How can a machine learn to recognize visual attributes emerging out of online community without a definitive supervised dataset? This paper proposes an automatic approach to discover and analyze visual attributes from a noisy collection of image-text data on the Web. Our approach is based on the relationship between attributes and neural activations in the deep network. We characterize the visual property of the attribute word as a divergence within weakly-annotated set of images. We show that the neural activations are useful for discovering and learning a classifier that well agrees with human perception from the noisy real-world Web data. The empirical study suggests the layered structure of the deep neural networks also gives us insights into the perceptual depth of the given word. Finally, we demonstrate that we can utilize highly-activating neurons for finding semantically relevant regions.

URL: http://arxiv.org/abs/1607.07262

Notes: Кажется, близко к тому, что пытался рассказать студент Сергея в воскресенье.

Deep nets for local manifold learning

Authors: Charles K. Chui, H. N. Mhaskar

Abstract: The problem of extending a function f defined on a training data C on an unknown manifold X to the entire manifold and a tubular neighborhood of this manifold is considered in this paper. For X embedded in a high dimensional ambient Euclidean space RD, a deep learning algorithm is developed for finding a local coordinate system for the manifold {\bf without eigen--decomposition}, which reduces the problem to the classical problem of function approximation on a low dimensional cube. Deep nets (or multilayered neural networks) are proposed to accomplish this approximation scheme by using the training data. Our methods do not involve such optimization techniques as back--propagation, while assuring optimal (a priori) error bounds on the output in terms of the number of derivatives of the target function. In addition, these methods are universal, in that they do not require a prior knowledge of the smoothness of the target function, but adjust the accuracy of approximation locally and automatically, depending only upon the local smoothness of the target function. Our ideas are easily extended to solve both the pre--image problem and the out--of--sample extension problem, with a priori bounds on the growth of the function thus extended.

URL: http://arxiv.org/abs/1607.07110

Notes: Близко к тому, о чем упоминал Егор при устройстве на стажировку.

CFGs-2-NLU: Sequence-to-Sequence Learning for Mapping Utterances to Semantics and Pragmatics

Authors: Adam James Summerville, James Ryan, Michael Mateas, Noah Wardrip-Fruin

Abstract: In this paper, we present a novel approach to natural language understanding that utilizes context-free grammars (CFGs) in conjunction with sequence-to-sequence (seq2seq) deep learning. Specifically, we take a CFG authored to generate dialogue for our target application for NLU, a videogame, and train a long short-term memory (LSTM) recurrent neural network (RNN) to map the surface utterances that it produces to traces of the grammatical expansions that yielded them. Critically, this CFG was authored using a tool we have developed that supports arbitrary annotation of the nonterminal symbols in the grammar. Because we already annotated the symbols in this grammar for the semantic and pragmatic considerations that our game's dialogue manager operates over, we can use the grammatical trace associated with any surface utterance to infer such information. During gameplay, we translate player utterances into grammatical traces (using our RNN), collect the mark-up attributed to the symbols included in that trace, and pass this information to the dialogue manager, which updates the conversation state accordingly. From an offline evaluation task, we demonstrate that our trained RNN translates surface utterances to grammatical traces with great accuracy. To our knowledge, this is the first usage of seq2seq learning for conversational agents (our game's characters) who explicitly reason over semantic and pragmatic considerations.

URL: http://arxiv.org/abs/1607.06852

Notes:

Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings

Authors: Tolga Bolukbasi, Kai-Wei Chang, James Zou, Venkatesh Saligrama, Adam Kalai

Abstract: The blind application of machine learning runs the risk of amplifying biases present in data. Such a danger is facing us with word embedding, a popular framework to represent text data as vectors which has been used in many machine learning and natural language processing tasks. We show that even word embeddings trained on Google News articles exhibit female/male gender stereotypes to a disturbing extent. This raises concerns because their widespread use, as we describe, often tends to amplify these biases. Geometrically, gender bias is first shown to be captured by a direction in the word embedding. Second, gender neutral words are shown to be linearly separable from gender definition words in the word embedding. Using these properties, we provide a methodology for modifying an embedding to remove gender stereotypes, such as the association between between the words receptionist and female, while maintaining desired associations such as between the words queen and female. We define metrics to quantify both direct and indirect gender biases in embeddings, and develop algorithms to "debias" the embedding. Using crowd-worker evaluation as well as standard benchmarks, we empirically demonstrate that our algorithms significantly reduce gender bias in embeddings while preserving the its useful properties such as the ability to cluster related concepts and to solve analogy tasks. The resulting embeddings can be used in applications without amplifying gender bias.

URL: http://arxiv.org/abs/1607.06520

Notes:

Layer Normalization

Authors: Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E. Hinton

Abstract: Training state-of-the-art, deep neural networks is computationally expensive. One way to reduce the training time is to normalize the activities of the neurons. A recently introduced technique called batch normalization uses the distribution of the summed input to a neuron over a mini-batch of training cases to compute a mean and variance which are then used to normalize the summed input to that neuron on each training case. This significantly reduces the training time in feed-forward neural networks. However, the effect of batch normalization is dependent on the mini-batch size and it is not obvious how to apply it to recurrent neural networks. In this paper, we transpose batch normalization into layer normalization by computing the mean and variance used for normalization from all of the summed inputs to the neurons in a layer on a single training case. Like batch normalization, we also give each neuron its own adaptive bias and gain which are applied after the normalization but before the non-linearity. Unlike batch normalization, layer normalization performs exactly the same computation at training and test times. It is also straightforward to apply to recurrent neural networks by computing the normalization statistics separately at each time step. Layer normalization is very effective at stabilizing the hidden state dynamics in recurrent networks. Empirically, we show that layer normalization can substantially reduce the training time compared with previously published techniques.

URL: http://arxiv.org/abs/1607.06450

Notes:

Sequence to sequence learning for unconstrained scene text recognition

Authors: Ahmed Mamdouh A. Hassanien

Abstract: In this work we present a state-of-the-art approach for unconstrained natural scene text recognition. We propose a cascade approach that incorporates a convolutional neural network (CNN) architecture followed by a long short term memory model (LSTM). The CNN learns visual features for the characters and uses them with a softmax layer to detect sequence of characters. While the CNN gives very good recognition results, it does not model relation between characters, hence gives rise to false positive and false negative cases (confusing characters due to visual similarities like "g" and "9", or confusing background patches with characters; either removing existing characters or adding non-existing ones) To alleviate these problems we leverage recent developments in LSTM architectures to encode contextual information. We show that the LSTM can dramatically reduce such errors and achieve state-of-the-art accuracy in the task of unconstrained natural scene text recognition. Moreover we manually remove all occurrences of the words that exist in the test set from our training set to test whether our approach will generalize to unseen data. We use the ICDAR 13 test set for evaluation and compare the results with the state of the art approaches [11, 18]. We finally present an application of the work in the domain of for traffic monitoring.

URL: http://arxiv.org/abs/1607.06125

Notes:

Dataset and Neural Recurrent Sequence Labeling Model for Open-Domain Factoid Question Answering

Authors: Peng Li, Wei Li, Zhengyan He, Xuguang Wang, Ying Cao, Jie Zhou, Wei Xu

Abstract: While question answering (QA) with neural network, i.e. neural QA, has achieved promising results in recent years, lacking of large scale real-word QA dataset is still a challenge for developing and evaluating neural QA system. To alleviate this problem, we propose a large scale human annotated real-world QA dataset WebQA with more than 42k questions and 556k evidences. As existing neural QA methods resolve QA either as sequence generation or classification/ranking problem, they face challenges of expensive softmax computation, unseen answers handling or separate candidate answer generation component. In this work, we cast neural QA as a sequence labeling problem and propose an end-to-end sequence labeling model, which overcomes all the above challenges. Experimental results on WebQA show that our model outperforms the baselines significantly with an F1 score of 74.69% with word-based input, and the performance drops only 3.72 F1 points with more challenging character-based input.

URL: http://arxiv.org/abs/1607.06275

Notes:

Compositional Sequence Labeling Models for Error Detection in Learner Writing

Authors: Marek Rei, Helen Yannakoudakis

Abstract: In this paper, we present the first experiments using neural network models for the task of error detection in learner writing. We perform a systematic comparison of alternative compositional architectures and propose a framework for error detection based on bidirectional LSTMs. Experiments on the CoNLL-14 shared task dataset show the model is able to outperform other participants on detecting errors in learner writing. Finally, the model is integrated with a publicly deployed self-assessment system, leading to performance comparable to human annotators.

URL: http://arxiv.org/abs/1607.06153

Notes:

Training Skinny Deep Neural Networks with Iterative Hard Thresholding Methods

Authors: Xiaojie Jin, Xiaotong Yuan, Jiashi Feng, Shuicheng Yan

Abstract: Deep neural networks have achieved remarkable success in a wide range of practical problems. However, due to the inherent large parameter space, deep models are notoriously prone to overfitting and difficult to be deployed in portable devices with limited memory. In this paper, we propose an iterative hard thresholding (IHT) approach to train Skinny Deep Neural Networks (SDNNs). An SDNN has much fewer parameters yet can achieve competitive or even better performance than its full CNN counterpart. More concretely, the IHT approach trains an SDNN through following two alternative phases: (I) perform hard thresholding to drop connections with small activations and fine-tune the other significant filters; (II)~re-activate the frozen connections and train the entire network to improve its overall discriminative capability. We verify the superiority of SDNNs in terms of efficiency and classification performance on four benchmark object recognition datasets, including CIFAR-10, CIFAR-100, MNIST and ImageNet. Experimental results clearly demonstrate that IHT can be applied for training SDNN based on various CNN architectures such as NIN and AlexNet.

URL: http://arxiv.org/abs/1607.05423

Notes: Интересный подход к тренировке "тонких" сетей

Stochastic Backpropagation through Mixture Density Distributions

Authors: Alex Graves

Abstract: The ability to backpropagate stochastic gradients through continuous latent distributions has been crucial to the emergence of variational autoencoders and stochastic gradient variational Bayes. The key ingredient is an unbiased and low-variance way of estimating gradients with respect to distribution parameters from gradients evaluated at distribution samples. The "reparameterization trick" provides a class of transforms yielding such estimators for many continuous distributions, including the Gaussian and other members of the location-scale family. However the trick does not readily extend to mixture density models, due to the difficulty of reparameterizing the discrete distribution over mixture weights. This report describes an alternative transform, applicable to any continuous multivariate distribution with a differentiable density function from which samples can be drawn, and uses it to derive an unbiased estimator for mixture density weight derivatives. Combined with the reparameterization trick applied to the individual mixture components, this estimator makes it straightforward to train variational autoencoders with mixture-distributed latent variables, or to perform stochastic variational inference with a mixture density variational posterior.

URL: http://arxiv.org/abs/1607.05690

Notes: Стохастический граниент от Алекса Грейвза

Neural Contextual Conversation Learning with Labeled Question-Answering Pairs

Authors: Kun Xiong, Anqi Cui, Zefeng Zhang, Ming Li

Abstract: Neural conversational models tend to produce generic or safe responses in different contexts, e.g., reply \textit{"Of course"} to narrative statements or \textit{"I don't know"} to questions. In this paper, we propose an end-to-end approach to avoid such problem in neural generative models. Additional memory mechanisms have been introduced to standard sequence-to-sequence (seq2seq) models, so that context can be considered while generating sentences. Three seq2seq models, which memorize a fix-sized contextual vector from hidden input, hidden input/output and a gated contextual attention structure respectively, have been trained and tested on a dataset of labeled question-answering pairs in Chinese. The model with contextual attention outperforms others including the state-of-the-art seq2seq models on perplexity test. The novel contextual model generates diverse and robust responses, and is able to carry out conversations on a wide range of topics appropriately.

URL: http://arxiv.org/abs/1607.05809

Notes:

Constructing a Natural Language Inference Dataset using Generative Neural Networks

Authors: Janez Starc, Dunja Mladenić

Abstract: Natural Language Inference is an important task for Natural Language Understanding. It is concerned with classifying the logical relation between two sentences. In this paper, we propose several text generative neural networks for constructing Natural Language Inference datasets suitable for training classifiers. To evaluate the models, we propose a new metric - the accuracy of the classifier trained on the generated dataset. The accuracy obtained with our best generative model is only 2.7% lower than the accuracy of the classifier trained on the original, manually constructed dataset. The model learns a mapping embedding for each training example. By comparing various metrics we show that datasets that obtain higher ROUGE or METEOR scores do not necessarily yield higher classification accuracies. We also provide analysis of what are the characteristics of a good dataset including the distinguishability of the generated datasets from the original one.

URL: http://arxiv.org/abs/1607.06025

Notes:

An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation

Authors: Jey Han Lau, Timothy Baldwin

Abstract: Recently, Le and Mikolov (2014) proposed doc2vec as an extension to word2vec (Mikolov et al., 2013a) to learn document-level embeddings. Despite promising results in the original paper, others have struggled to reproduce those results. This paper presents a rigorous empirical evaluation of doc2vec over two tasks. We compare doc2vec to two baselines and two state-of-the-art document embedding methodologies. We found that doc2vec performs robustly when using models trained on large external corpora, and can be further improved by using pre-trained word embeddings. We also provide recommendations on hyper-parameter settings for general purpose applications, and release source code to induce document embeddings using our trained doc2vec models.

URL: http://arxiv.org/abs/1607.05368

Notes:

Neural Machine Translation with Recurrent Attention Modeling

Authors: Zichao Yang, Zhiting Hu, Yuntian Deng, Chris Dyer, Alex Smola

Abstract: Knowing which words have been attended to in previous time steps while generating a translation is a rich source of information for predicting what words will be attended to in the future. We improve upon the attention model of Bahdanau et al. (2014) by explicitly modeling the relationship between previous and subsequent attention levels for each word using one recurrent network per input word. This architecture easily captures informative features, such as fertility and regularities in relative distortion. In experiments, we show our parameterization of attention improves translation quality.

URL: http://arxiv.org/abs/1607.05108

Notes:

An Empirical Evaluation of various Deep Learning Architectures for Bi-Sequence Classification Tasks

Authors: Anirban Laha, Vikas Raykar

Abstract: Several tasks in argumentation mining and debating, question-answering, and natural language inference involve categorizing a sentence in the context of another sentence (referred as bi-sequence classification). For several single sequence classification tasks, the current state-of-the-art approaches are based on recurrent and convolutional neural networks. On the other hand, for bi-sequence classification problems, there is not much understanding as to the best deep learning architecture. In this paper, we attempt to get an understanding of this category of problems by extensive empirical evaluation of 19 different deep learning architectures (specifically on different ways of handling context) for various problems originating in natural language processing like debating, textual entailment and question-answering. Following the empirical evaluation, we offer our insights and conclusions regarding the architectures we have considered. We also establish the first deep learning baselines for three argumentation mining tasks.

URL: http://arxiv.org/abs/1607.04853

Notes:

DSD: Regularizing Deep Neural Networks with Dense-Sparse-Dense Training Flow

Authors: Song Han, Jeff Pool, Sharan Narang, Huizi Mao, Shijian Tang, Erich Elsen, Bryan Catanzaro, John Tran, William J. Dally

Abstract: Abstracet: Modern deep neural networks have a large number of parameters, making them very powerful machine learning systems. A critical issue for training such large networks on large-scale data-sets is to prevent overfitting while at the same time providing enough model capacity. We propose DSD, a dense-sparse-dense training flow, for regularizing deep neural networks. In the first D step, we train a dense network to learn which connections are important. In the S step, we regularize the network by pruning the unimportant connections and retrain the network given the sparsity constraint. In the final D step, we increase the model capacity by freeing the sparsity constraint, re-initializing the pruned parameters, and retraining the whole dense network. Experiments show that DSD training can improve the performance of a wide range of CNN, RNN and LSTMs on the tasks of image classification, caption generation and speech recognition. On the Imagenet dataset, DSD improved the absolute accuracy of AlexNet, GoogleNet, VGG-16, ResNet-50, ResNet-152 and SqueezeNet by a geo-mean of 2.1 points(Top-1) and 1.4 points(Top-5). On the WSJ'92 and WSJ'93 dataset, DSD improved DeepSpeech-2 WER by 0.53 and 1.08 points. On the Flickr-8K dataset, DSD improved the NeuralTalk BLEU score by 2.0 points. DSD training flow produces the same model architecture and doesn't incur any inference overhead.

URL: http://arxiv.org/abs/1607.04381

Notes:

Neural Semantic Encoders

Authors: Tsendsuren Munkhdalai, Hong Yu

Abstract: We present a memory augmented neural network for natural language understanding: Neural Semantic Encoders (NSE). NSE has a variable sized encoding memory that evolves over time and maintains the understanding of input sequences through read, compose and write operations. NSE can access multiple and shared memories depending on the complexity of a task. We demonstrated the effectiveness and the flexibility of NSE on five different natural language tasks, natural language inference, question answering, sentence classification, document sentiment analysis and machine translation where NSE achieved state-of-the-art performance when evaluated on publically available benchmarks. For example, our shared-memory model showed an encouraging result on neural machine translation, improving an attention-based baseline by approximately 1.0 BLEU.

URL: http://arxiv.org/abs/1607.04315

Notes:

Neural Discourse Modeling of Conversations

Authors: John M. Pierre, Mark Butler, Jacob Portnoff, Luis Aguilar

Abstract: Deep neural networks have shown recent promise in many language-related tasks such as the modeling of conversations. We extend RNN-based sequence to sequence models to capture the long range discourse across many turns of conversation. We perform a sensitivity analysis on how much additional context affects performance, and provide quantitative and qualitative evidence that these models are able to capture discourse relationships across multiple utterances. Our results quantifies how adding an additional RNN layer for modeling discourse improves the quality of output utterances and providing more of the previous conversation as input also improves performance. By searching the generated outputs for specific discourse markers we show how neural discourse models can exhibit increased coherence and cohesion in conversations.

URL: http://arxiv.org/abs/1607.04576

Notes:

Neural Tree Indexers for Text Understanding

Authors: Tsendsuren Munkhdalai, Hong Yu

Abstract: Neural networks with recurrent or recursive architecture have shown promising results on various natural language processing (NLP) tasks. The recurrent and recursive architectures have their own strength and limitations. The recurrent networks process input text sequentially and model the conditional transition between word tokens. In contrast, the recursive networks explicitly model the compositionality and the recursive structure of natural language. Current recursive architecture is based on syntactic tree, thus limiting its practical applicability in different NLP applications. In this paper, we introduce a class of tree structured model, Neural Tree Indexers (NTI) that provides a middle ground between the sequential RNNs and the syntactic tree-based recursive models. NTI constructs a full n-ary tree by processing the input text with its node function in a bottom-up fashion. Attention mechanism can then be applied to both structure and different forms of node function. We demonstrated the effectiveness and the flexibility of a binary-tree model of NTI, showing the model achieved the state-of-the-art performance on three different NLP tasks: natural language inference, answer sentence selection, and sentence classification.

URL: http://arxiv.org/abs/1607.04492

Notes:

Attention-over-Attention Neural Networks for Reading Comprehension

Authors: Yiming Cui, Zhipeng Chen, Si Wei, Shijin Wang, Ting Liu, Guoping Hu

Abstract: Cloze-style queries are representative problems in reading comprehension. Over the past few months, we have seen much progress that utilizing neural network approach to solve Cloze-style questions. In this paper, we present a novel model called attention-over-attention reader for the Cloze-style reading comprehension task. Our model aims to place another attention mechanism over the document-level attention, and induces "attended attention" for final predictions. Unlike the previous works, our neural network model requires less pre-defined hyper-parameters and uses an elegant architecture for modeling. Experimental results show that the proposed attention-over-attention model significantly outperforms various state-of-the-art systems by a large margin in public datasets, such as CNN and Children's Book Test datasets.

URL: http://arxiv.org/abs/1607.04423

Notes:

Enriching Word Vectors with Subword Information

Authors: Piotr Bojanowski, Edouard Grave, Armand Joulin, Tomas Mikolov

Abstract: Continuous word representations, trained on large unlabeled corpora are useful for many natural language processing tasks. Many popular models to learn such representations ignore the morphology of words, by assigning a distinct vector to each word. This is a limitation, especially for morphologically rich languages with large vocabularies and many rare words. In this paper, we propose a new approach based on the skip-gram model, where each word is represented as a bag of character n-grams. A vector representation is associated to each character n-gram, words being represented as the sum of these representations. Our method is fast, allowing to train models on large corpus quickly. We evaluate the obtained word representations on five different languages, on word similarity and analogy tasks.

URL: http://arxiv.org/abs/1607.04606

Notes: Mikolov's paper on fasttext: word2vec + sum of vectors from ngram2vec

2016-08

A Neural Knowledge Language Model

Authors: Sungjin Ahn, Heeyoul Choi, Tanel Pärnamaa, Yoshua Bengio

Abstract: Communicating knowledge is a primary purpose of language. However, current language models have significant limitations in their ability to encode or decode knowledge. This is mainly because they acquire knowledge based on statistical co-occurrences, even if most of the knowledge words are rarely observed named entities. In this paper, we propose a Neural Knowledge Language Model (NKLM) which combines symbolic knowledge provided by knowledge graphs with RNN language models. At each time step, the model predicts a fact on which the observed word is supposed to be based. Then, a word is either generated from the vocabulary or copied from the knowledge graph. We train and test the model on a new dataset, WikiFacts. In experiments, we show that the NKLM significantly improves the perplexity while generating a much smaller number of unknown words. In addition, we demonstrate that the sampled descriptions include named entities which were used to be the unknown words in RNN language models.

URL: http://arxiv.org/abs/1608.00318

Notes:

Supervised Attentions for Neural Machine Translation

Authors: Haitao Mi, Zhiguo Wang, Abe Ittycheriah

Abstract: In this paper, we improve the attention or alignment accuracy of neural machine translation by utilizing the alignments of training sentence pairs. We simply compute the distance between the machine attentions and the "true" alignments, and minimize this cost in the training procedure. Our experiments on large-scale Chinese-to-English task show that our model improves both translation and alignment qualities significantly over the large-vocabulary neural machine translation system, and even beats a state-of-the-art traditional syntax-based system.

URL: http://arxiv.org/abs/1608.00112

Notes:

Learning Semantically Coherent and Reusable Kernels in Convolution Neural Nets for Sentence Classification

Authors: Madhusudan Lakshmana, Sundararajan Sellamanickam, Shirish Shevade, Keerthi Selvaraj

Abstract: The state-of-the-art CNN models give good performance on sentence classification tasks. The purpose of this work is to empirically study desirable properties such as semantic coherence, attention mechanism and reusability of CNNs in these tasks. Semantically coherent kernels are preferable as they are a lot more interpretable for explaining the decision of the learned CNN model. We observe that the learned kernels do not have semantic coherence. Motivated by this observation, we propose to learn kernels with semantic coherence using clustering scheme combined with Word2Vec representation and domain knowledge such as SentiWordNet. We suggest a technique to visualize attention mechanism of CNNs for decision explanation purpose. Reusable property enables kernels learned on one problem to be used in another problem. This helps in efficient learning as only a few additional domain specific filters may have to be learned. We demonstrate the efficacy of our core ideas of learning semantically coherent kernels and leveraging reusable kernels for efficient learning on several benchmark datasets. Experimental results show the usefulness of our approach by achieving performance close to the state-of-the-art methods but with semantic and reusable properties.

URL: http://arxiv.org/abs/1608.00466

Notes:

Hyperparameter Transfer Learning through Surrogate Alignment for Efficient Deep Neural Network Training

Authors: Ilija Ilievski, Jiashi Feng

Abstract: Recently, several optimization methods have been successfully applied to the hyperparameter optimization of deep neural networks (DNNs). The methods work by modeling the joint distribution of hyperparameter values and corresponding error. Those methods become less practical when applied to modern DNNs whose training may take a few days and thus one cannot collect sufficient observations to accurately model the distribution. To address this challenging issue, we propose a method that learns to transfer optimal hyperparameter values for a small source dataset to hyperparameter values with comparable performance on a dataset of interest. As opposed to existing transfer learning methods, our proposed method does not use hand-designed features. Instead, it uses surrogates to model the hyperparameter-error distributions of the two datasets and trains a neural network to learn the transfer function. Extensive experiments on three CV benchmark datasets clearly demonstrate the efficiency of our method.

URL: http://arxiv.org/abs/1608.00218

Notes:

Modeling Context in Referring Expressions

Authors: Licheng Yu, Patric Poirson, Shan Yang, Alexander C. Berg, Tamara L. Berg

Abstract: Humans refer to objects in their environments all the time, especially in dialogue with other people. We explore generating and comprehending natural language referring expressions for objects in images. In particular, we focus on incorporating better measures of visual context into referring expression models and find that visual comparison to other objects within an image helps improve performance significantly. We also develop methods to tie the language generation process together, so that we generate expressions for all objects of a particular category jointly. Evaluation on three recent datasets - RefCOCO, RefCOCO+, and RefCOCOg, shows the advantages of our methods for both referring expression generation and comprehension.

URL: http://arxiv.org/abs/1608.00272

Notes:

Visual Relationship Detection with Language Priors

Authors: Cewu Lu, Ranjay Krishna, Michael Bernstein, Li Fei-Fei

Abstract: Visual relationships capture a wide variety of interactions between pairs of objects in images (e.g. "man riding bicycle" and "man pushing bicycle"). Consequently, the set of possible relationships is extremely large and it is difficult to obtain sufficient training examples for all possible relationships. Because of this limitation, previous work on visual relationship detection has concentrated on predicting only a handful of relationships. Though most relationships are infrequent, their objects (e.g. "man" and "bicycle") and predicates (e.g. "riding" and "pushing") independently occur more frequently. We propose a model that uses this insight to train visual models for objects and predicates individually and later combines them together to predict multiple relationships per image. We improve on prior work by leveraging language priors from semantic word embeddings to finetune the likelihood of a predicted relationship. Our model can scale to predict thousands of types of relationships from a few examples. Additionally, we localize the objects in the predicted relationships as bounding boxes in the image. We further demonstrate that understanding relationships can improve content based image retrieval.

URL: http://arxiv.org/abs/1608.00187

Notes:

Top-down Neural Attention by Excitation Backprop

Authors: Jianming Zhang, Zhe Lin, Jonathan Brandt, Xiaohui Shen, Stan Sclaroff

Abstract: We aim to model the top-down attention of a Convolutional Neural Network (CNN) classifier for generating task-specific attention maps. Inspired by a top-down human visual attention model, we propose a new backpropagation scheme, called Excitation Backprop, to pass along top-down signals downwards in the network hierarchy via a probabilistic Winner-Take-All process. Furthermore, we introduce the concept of contrastive attention to make the top-down attention maps more discriminative. In experiments, we demonstrate the accuracy and generalizability of our method in weakly supervised localization tasks on the MS COCO, PASCAL VOC07 and ImageNet datasets. The usefulness of our method is further validated in the text-to-region association task. On the Flickr30k Entities dataset, we achieve promising performance in phrase localization by leveraging the top-down attention of a CNN model that has been trained on weakly labeled web images.

URL: http://arxiv.org/abs/1608.00507

Notes:

Knowledge Distillation for Small-footprint Highway Networks

Authors: Liang Lu, Michelle Guo, Steve Renals

Abstract: Deep learning has significantly advanced state-of-the-art of speech recognition in the past few years. However, compared to conventional Gaussian mixture acoustic models, neural network models are usually much larger, and are therefore not very deployable in embedded devices. Previously, we investigated a compact highway deep neural network (HDNN) for acoustic modelling, which is a type of depth-gated feedforward neural network. We have shown that HDNN-based acoustic models can achieve comparable recognition accuracy with much smaller number of model parameters compared to plain deep neural network (DNN) acoustic models. In this paper, we push the boundary further by leveraging on the knowledge distillation technique that is also known as {\it teacher-student} training, i.e., we train the compact HDNN model with the supervision of a high accuracy cumbersome model. Furthermore, we also investigate sequence training and adaptation in the context of teacher-student training. Our experiments were performed on the AMI meeting speech recognition corpus. With this technique, we significantly improved the recognition accuracy of the HDNN acoustic model with less than 0.8 million parameters, and narrowed the gap between this model and the plain DNN with 30 million parameters.

URL: http://arxiv.org/abs/1608.00892

Notes:

Structured prediction models for RNN based sequence labeling in clinical text

Authors: Abhyuday Jagannatha, Hong Yu

Abstract: Sequence labeling is a widely used method for named entity recognition and information extraction from unstructured natural language data. In clinical domain one major application of sequence labeling involves extraction of medical entities such as medication, indication, and side-effects from Electronic Health Record narratives. Sequence labeling in this domain, presents its own set of challenges and objectives. In this work we experimented with various CRF based structured learning models with Recurrent Neural Networks. We extend the previously studied LSTM-CRF models with explicit modeling of pairwise potentials. We also propose an approximate version of skip-chain CRF inference with RNN potentials. We use these methodologies for structured prediction in order to improve the exact phrase detection of various medical entities.

URL: http://arxiv.org/abs/1608.00612

Notes:

Semantic Representations of Word Senses and Concepts

Authors: José Camacho-Collados, Ignacio Iacobacci, Roberto Navigli, Mohammad Taher Pilehvar

Abstract: Representing the semantics of linguistic items in a machine-interpretable form has been a major goal of Natural Language Processing since its earliest days. Among the range of different linguistic items, words have attracted the most research attention. However, word representations have an important limitation: they conflate different meanings of a word into a single vector. Representations of word senses have the potential to overcome this inherent limitation. Indeed, the representation of individual word senses and concepts has recently gained in popularity with several experimental results showing that a considerable performance improvement can be achieved across different NLP applications upon moving from word level to the deeper sense and concept levels. Another interesting point regarding the representation of concepts and word senses is that these models can be seamlessly applied to other linguistic items, such as words, phrases and sentences.

URL: http://arxiv.org/abs/1608.00841

Notes:

Learning Online Alignments with Continuous Rewards Policy Gradient

Authors: Yuping Luo, Chung-Cheng Chiu, Navdeep Jaitly, Ilya Sutskever

Abstract: Sequence-to-sequence models with soft attention had significant success in machine translation, speech recognition, and question answering. Though capable and easy to use, they require that the entirety of the input sequence is available at the beginning of inference, an assumption that is not valid for instantaneous translation and speech recognition. To address this problem, we present a new method for solving sequence-to-sequence problems using hard online alignments instead of soft offline alignments. The online alignments model is able to start producing outputs without the need to first process the entire input sequence. A highly accurate online sequence-to-sequence model is useful because it can be used to build an accurate voice-based instantaneous translator. Our model uses hard binary stochastic decisions to select the timesteps at which outputs will be produced. The model is trained to produce these stochastic decisions using a standard policy gradient method. In our experiments, we show that this model achieves encouraging performance on TIMIT and Wall Street Journal (WSJ) speech recognition datasets.

URL: http://arxiv.org/abs/1608.01281

Notes:

Morphological Priors for Probabilistic Neural Word Embeddings

Authors: Parminder Bhatia, Robert Guthrie, Jacob Eisenstein

Abstract: Word embeddings allow natural language processing systems to share statistical information across related words. These embeddings are typically based on distributional statistics, making it difficult for them to generalize to rare or unseen words. We propose to improve word embeddings by incorporating morphological information, capturing shared sub-word features. Unlike previous work that constructs word embeddings directly from morphemes, we combine morphological and distributional information in a unified probabilistic framework, in which the word embedding is a latent variable. The morphological information provides a prior distribution on the latent word embeddings, which in turn condition a likelihood function over an observed corpus. This approach yields improvements on intrinsic word similarity evaluations, and also in the downstream task of part-of-speech tagging.

URL: http://arxiv.org/abs/1608.01056

Notes:

Residual Networks of Residual Networks: Multilevel Residual Networks

Authors: Ke Zhang, Miao Sun, Tony X. Han, Xingfang Yuan, Liru Guo, Tao Liu

Abstract: Residual networks family with hundreds or even thousands of layers dominate major image recognition tasks, but building a network by simply stacking residual blocks inevitably limits its optimization ability. This paper proposes a novel residual-network architecture, Residual networks of Residual networks (RoR), to dig the optimization ability of residual networks. RoR substitutes optimizing residual mapping of residual mapping for optimizing original residual mapping, in particular, adding level-wise shortcut connections upon original residual networks, to promote the learning capability of residual networks. More importantly, RoR can be applied to various kinds of residual networks (Pre-ResNets and WRN) and significantly boost their performance. Our experiments demonstrate the effectiveness and versatility of RoR, where it achieves the best performance in all residual-network-like structures. Our RoR-3-WRN58-4 models achieve new state-of-the-art results on CIFAR-10, CIFAR-100 and SVHN, with test errors 3.77%, 19.73% and 1.59% respectively. These results outperform 1001-layer Pre-ResNets by 18.4% on CIFAR-10 and 13.1% on CIFAR-100.

URL: http://arxiv.org/abs/1608.02908

Notes:

Temporal Attention Model for Neural Machine Translation

Authors: Baskaran Sankaran, Haitao Mi, Yaser Al-Onaizan, Abe Ittycheriah

Abstract: Attention-based Neural Machine Translation (NMT) models suffer from attention deficiency issues as has been observed in recent research. We propose a novel mechanism to address some of these limitations and improve the NMT attention. Specifically, our approach memorizes the alignments temporally (within each sentence) and modulates the attention with the accumulated temporal memory, as the decoder generates the candidate translation. We compare our approach against the baseline NMT model and two other related approaches that address this issue either explicitly or implicitly. Large-scale experiments on two language pairs show that our approach achieves better and robust gains over the baseline and related NMT approaches. Our model further outperforms strong SMT baselines in some settings even without using ensembles.

URL: http://arxiv.org/abs/1608.02927

Notes:

Syntactically Informed Text Compression with Recurrent Neural Networks

Authors: David Cox

Abstract: We present a self-contained system for constructing natural language models for use in text compression. Our system improves upon previous neural network based models by utilizing recent advances in syntactic parsing — Google's SyntaxNet — to augment character-level recurrent neural networks. RNNs have proven exceptional in modeling sequence data such as text, as their architecture allows for modeling of long-term contextual information. Modeling and coding are the backbone of modern compression schemes. While coding is considered a solved problem, generating effective, domain-specific models remains a critical step in the process of improving compression ratios.

URL: http://arxiv.org/abs/1608.02893

Notes:

A deep language model for software code

Authors: Hoa Khanh Dam, Truyen Tran, Trang Pham

Abstract: Existing language models such as n-grams for software code often fail to capture a long context where dependent code elements scatter far apart. In this paper, we propose a novel approach to build a language model for software code to address this particular issue. Our language model, partly inspired by human memory, is built upon the powerful deep learning-based Long Short Term Memory architecture that is capable of learning long-term dependencies which occur frequently in software code. Results from our intrinsic evaluation on a corpus of Java projects have demonstrated the effectiveness of our language model. This work contributes to realizing our vision for DeepSoft, an end-to-end, generic deep learning-based framework for modeling software and its development process.

URL: http://arxiv.org/abs/1608.02715

Notes:

Multi-task Multi-domain Representation Learning for Sequence Tagging

Authors: Nanyun Peng, Mark Dredze

Abstract: Representation learning with deep models have demonstrated success in a range of NLP. In this paper we consider its use in a multi-task multi-domain setting for sequence tagging by proposing a unified framework for learning across tasks and domains. Our model learns robust representations that yield better performance in this setting. We use shared CRFs and domain projections to allow the model to learn domain specific representations that can feed a single task specific CRF. We evaluate our model on two tasks — Chinese word segmentation and named entity recognition — and two domains — news and social media — and achieve state-of-the-art results for both social media tasks.

URL: http://arxiv.org/abs/1608.02689

Notes:

Residual CNDS

Authors: Hussein A. Al-Barazanchi, Hussam Qassim, Abhishek Verma

Abstract: Convolutional Neural networks nowadays are of tremendous importance for any image classification system. One of the most investigated methods to increase the accuracy of CNN is by increasing the depth of CNN. Increasing the depth by stacking more layers also increases the difficulty of training besides making it computationally expensive. Some research found that adding auxiliary forks after intermediate layers increases the accuracy. Specifying which intermediate layer shoud have the fork just addressed recently. Where a simple rule were used to detect the position of intermediate layers that needs the auxiliary supervision fork. This technique known as convolutional neural networks with deep supervision (CNDS). This technique enhanced the accuracy of classification over the straight forward CNN used on the MIT places dataset and ImageNet. In the other side, Residual Learning is another technique emerged recently to ease the training of very deep CNN. Residual Learning framwork changed the learning of layers from unreferenced functions to learning residual function with regard to the layer's input. Residual Learning achieved state of arts results on ImageNet 2015 and COCO competitions. In this paper, we study the effect of adding residual connections to CNDS network. Our experiments results show increasing of accuracy over using CNDS only.

URL: http://arxiv.org/abs/1608.02201

Notes:

Online Adaptation of Deep Architectures with Reinforcement Learning

Authors: Thushan Ganegedara, Lionel Ott, Fabio Ramos

Abstract: Online learning has become crucial to many problems in machine learning. As more data is collected sequentially, quickly adapting to changes in the data distribution can offer several competitive advantages such as avoiding loss of prior knowledge and more efficient learning. However, adaptation to changes in the data distribution (also known as covariate shift) needs to be performed without compromising past knowledge already built in into the model to cope with voluminous and dynamic data. In this paper, we propose an online stacked Denoising Autoencoder whose structure is adapted through reinforcement learning. Our algorithm forces the network to exploit and explore favourable architectures employing an estimated utility function that maximises the accuracy of an unseen validation sequence. Different actions, such as Pool, Increment and Merge are available to modify the structure of the network. As we observe through a series of experiments, our approach is more responsive, robust, and principled than its counterparts for non-stationary as well as stationary data distributions. Experimental results indicate that our algorithm performs better at preserving gained prior knowledge and responding to changes in the data distribution.

URL: http://arxiv.org/abs/1608.02292

Notes: По описанию близко к теме neural evolution

Bootstrapping Face Detection with Hard Negative Examples

Authors: Shaohua Wan, Zhijun Chen, Tao Zhang, Bo Zhang, Kong-kat Wong

Abstract: Recently significant performance improvement in face detection was made possible by deeply trained convolutional networks. In this report, a novel approach for training state-of-the-art face detector is described. The key is to exploit the idea of hard negative mining and iteratively update the Faster R-CNN based face detector with the hard negatives harvested from a large set of background examples. We demonstrate that our face detector outperforms state-of-the-art detectors on the FDDB dataset, which is the de facto standard for evaluating face detection algorithms.

URL: http://arxiv.org/abs/1608.02236

Notes: Может быть интересно для RL в диалогах, сильные отрицательные примеры нужно сформулировать для диалогов

Multi-Model Hypothesize-and-Verify Approach for Incremental Loop Closure Verification

Authors: Kanji Tanaka

Abstract: Loop closure detection, which is the task of identifying locations revisited by a robot in a sequence of odometry and perceptual observations, is typically formulated as a visual place recognition (VPR) task. However, even state-of-the-art VPR techniques generate a considerable number of false positives as a result of confusing visual features and perceptual aliasing. In this paper, we propose a robust incremental framework for loop closure detection, termed incremental loop closure verification. Our approach reformulates the problem of loop closure detection as an instance of a multi-model hypothesize-and-verify framework, in which multiple loop closure hypotheses are generated and verified in terms of the consistency between loop closure hypotheses and VPR constraints at multiple viewpoints along the robot's trajectory. Furthermore, we consider the general incremental setting of loop closure detection, in which the system must update both the set of VPR constraints and that of loop closure hypotheses when new constraints or hypotheses arrive during robot navigation. Experimental results using a stereo SLAM system and DCNN features and visual odometry validate effectiveness of the proposed approach.

URL: http://arxiv.org/abs/1608.02052

Notes: Формулирование гипотезы и ее проверка - вроде бы близко к идеям Миши

Towards Representation Learning with Tractable Probabilistic Models

Authors: Antonio Vergari, Nicola Di Mauro, Floriana Esposito

Abstract: Probabilistic models learned as density estimators can be exploited in representation learning beside being toolboxes used to answer inference queries only. However, how to extract useful representations highly depends on the particular model involved. We argue that tractable inference, i.e. inference that can be computed in polynomial time, can enable general schemes to extract features from black box models. We plan to investigate how Tractable Probabilistic Models (TPMs) can be exploited to generate embeddings by random query evaluations. We devise two experimental designs to assess and compare different TPMs as feature extractors in an unsupervised representation learning framework. We show some experimental results on standard image datasets by applying such a method to Sum-Product Networks and Mixture of Trees as tractable models generating embeddings.

URL: http://arxiv.org/abs/1608.02341

Notes:

Bi-directional Attention with Agreement for Dependency Parsing

Authors: Hao Cheng, Hao Fang, Xiaodong He, Jianfeng Gao, Li Deng

Abstract: We develop a novel bi-directional attention model for dependency parsing, which learns to agree on headword predictions from the forward and backward parsing directions. The parsing procedure for each direction is formulated as sequentially querying the memory component that stores continuous headword embeddings. The proposed parser makes use of soft headword embeddings, allowing the model to implicitly capture high-order parsing history without dramatically increasing the computational complexity. We conduct experiments on English, Chinese, and 12 other languages from the CoNLL 2006 shared task, showing that the proposed model achieves state-of-the-art unlabeled attachment scores on 7 languages.

URL: http://arxiv.org/abs/1608.02076

Notes:

Robsut Wrod Reocginiton via semi-Character Recurrent Neural Network

Authors: Keisuke Sakaguchi, Kevin Duh, Matt Post, Benjamin Van Durme

Abstract: The Cmabrigde Uinervtisy (Cambridge University) effect from the psycholinguistics literature has demonstrated a robust word processing mechanism in humans, where jumbled words (e.g. Cmabrigde / Cambridge) are recognized with little cost. Inspired by the findings from the Cmabrigde Uinervtisy effect, we propose a word recognition model based on a semi-character level recursive neural network (scRNN). In our experiments, we demonstrate that scRNN has significantly more robust performance in word spelling correction (i.e. word recognition) compared to existing spelling checkers. Furthermore, we demonstrate that the model is cognitively plausible by replicating a psycholinguistics experiment about human reading difficulty using our model.

URL: http://arxiv.org/abs/1608.02214

Notes:

Encoder-decoder with Focus-mechanism for Sequence Labelling Based Spoken Language Understanding

Authors: Su Zhu, Kai Yu

Abstract: This paper investigates the framework of encoder-decoder with attention for sequence labelling based Spoken Language Understanding. We introduce BLSTM-LSTM as the encoder-decoder model to fully utilize the power of deep learning. In the sequence labelling task, the input and output sequences are aligned word by word, while the attention mechanism can't provide the exact alignment. To address the limitations of attention mechanism in the sequence labelling task, we propose a novel focus mechanism. Experiments on the standard ATIS dataset showed that BLSTM-LSTM with focus mechanism defined the new state-of-the-art by outperforming standard BLSTM and attention based encoder-decoder. Further experiments also showed that the proposed model is more robust to speech recognition errors.

URL: http://arxiv.org/abs/1608.02097

Notes:

Detecting Sarcasm in Multimodal Social Platforms

Authors: Rossano Schifanella, Paloma de Juan, Joel Tetreault, Liangliang Cao

Abstract: Sarcasm is a peculiar form of sentiment expression, where the surface sentiment differs from the implied sentiment. The detection of sarcasm in social media platforms has been applied in the past mainly to textual utterances where lexical indicators (such as interjections and intensifiers), linguistic markers, and contextual information (such as user profiles, or past conversations) were used to detect the sarcastic tone. However, modern social media platforms allow to create multimodal messages where audiovisual content is integrated with the text, making the analysis of a mode in isolation partial. In our work, we first study the relationship between the textual and visual aspects in multimodal posts from three major social media platforms, i.e., Instagram, Tumblr and Twitter, and we run a crowdsourcing task to quantify the extent to which images are perceived as necessary by human annotators. Moreover, we propose two different computational frameworks to detect sarcasm that integrate the textual and visual modalities. The first approach exploits visual semantics trained on an external dataset, and concatenates the semantics features with state-of-the-art textual features. The second method adapts a visual neural network initialized with parameters trained on ImageNet to multimodal sarcastic posts. Results show the positive effect of combining modalities for the detection of sarcasm across platforms and methods.

URL: http://arxiv.org/abs/1608.02289

Notes:

Bridging the Gap: a Semantic Similarity Measure between Queries and Documents

Authors: Sun Kim, W. John Wilbur, Zhiyong Lu

Abstract: The main approach of traditional information retrieval (IR) is to examine how many words from a query appear in a document. A drawback of this approach, however, is that it may fail to detect relevant documents where no or only few words from a query are found. The semantic analysis methods such as LSA (latent semantic analysis) and LDA (latent Dirichlet allocation) have been proposed to address the issue, but their performance is not superior compared to common IR approaches. Here we present a query-document similarity measure motivated by the Word Mover's Distance. Unlike other similarity measures, the proposed method relies on neural word embeddings to calculate the distance between words. Our method is efficient and straightforward to implement. The experimental results on TREC and PubMed show that our approach provides significantly better performance than BM25. We also discuss the pros and cons of our approach and show that there is a synergy effect when the word embedding measure is combined with the BM25 function.

URL: http://arxiv.org/abs/1608.01972

Notes:

Text authorship identified using the dynamics of word co-occurrence networks

Authors: Camilo Akimushkin, Diego R. Amancio, Osvaldo N. Oliveira Jr

Abstract: The identification of authorship in disputed documents still requires human expertise, which is now unfeasible for many tasks owing to the large volumes of text and authors in practical applications. In this study, we introduce a methodology based on the dynamics of word co-occurrence networks representing written texts to classify a corpus of 80 texts by 8 authors. The texts were divided into sections with equal number of linguistic tokens, from which time series were created for 12 topological metrics. The series were proven to be stationary (p-value>0.05), which permits to use distribution moments as learning attributes. With an optimized supervised learning procedure using a Radial Basis Function Network, 68 out of 80 texts were correctly classified, i.e. a remarkable 85% author matching success rate. Therefore, fluctuations in purely dynamic network metrics were found to characterize authorship, thus opening the way for the description of texts in terms of small evolving networks. Moreover, the approach introduced allows for comparison of texts with diverse characteristics in a simple, fast fashion.

URL: http://arxiv.org/abs/1608.01965

Notes:

Generative Transfer Learning between Recurrent Neural Networks

Authors: Sungho Shin, Kyuyeon Hwang, Wonyong Sung

Abstract: Training a neural network demands a large amount of labeled data. Keeping the data after the training may not be allowed because of legal or privacy reasons. In this study, we train a new RNN, called a student network, using a previously developed RNN, the teacher network, without using the original data. The teacher network is used for generating a data for training the student network. In order to generate a long sequence of data that does not repeat, a random number assisted output label selection method is employed. The softmax output of the teacher RNN is used as for the soft target when training a student network. The performance evaluation is conducted using a character-level language model. The experimental results show that the proposed method yields good performance approaching that of the original data based training. This work not only gives insight to knowledge transfer between RNNs but also can be useful when the original training data is not available.

URL: http://arxiv.org/abs/1608.04077

Notes:

Power Series Classification: A Hybrid of LSTM and a Novel Advancing Dynamic Time Warping

Authors: Yuanlong Li, Han Hu, Yonggang Wen, Jun Zhang

Abstract: As many applications organize data into temporal sequences, the problem of time series data classification has been widely studied. Recent studies show that the 1-nearest neighbor with dynamic time warping (1NN-DTW) and the long short term memory (LSTM) neural network can achieve a better performance than other machine learning algorithms. In this paper, we build a novel time series classification algorithm hybridizing 1NN-DTW and LSTM, and apply it to a practical data center power monitoring problem. Firstly, we define a new distance measurement for the 1NN-DTW classifier, termed as Advancing Dynamic Time Warping (ADTW), which is non-commutative and non-dynamic programming. Secondly, we hybridize the 1NN-ADTW and LSTM together. In particular, a series of auxiliary test samples generated by the linear combination of the original test sample and its nearest neighbor with ADTW are utilized to detect which classifier to trust in the hybrid algorithm. Finally, using the power consumption data from a real data center, we show that the proposed ADTW can improve the classification accuracy from about 84% to 89%. Furthermore, with the hybrid algorithm, the accuracy can be further improved and we achieve an accuracy up to about 92%. Our research can inspire more studies on non-commutative distance measurement and the hybrid of the deep learning models with other traditional models.

URL: http://arxiv.org/abs/1608.04171

Notes:

Fine-grained Analysis of Sentence Embeddings Using Auxiliary Prediction Tasks

Authors: Yossi Adi, Einat Kermany, Yonatan Belinkov, Ofer Lavi, Yoav Goldberg

Abstract: There is a lot of research interest in encoding variable length sentences into fixed length vectors, in a way that preserves the sentence meanings. Two common methods include representations based on averaging word vectors, and representations based on the hidden states of recurrent neural networks such as LSTMs. The sentence vectors are used as features for subsequent machine learning tasks or for pre-training in the context of deep learning. However, not much is known about the properties that are encoded in these sentence representations and about the language information they capture. We propose a framework that facilitates better understanding of the encoded representations. We define prediction tasks around isolated aspects of sentence structure (namely sentence length, word content, and word order), and score representations by the ability to train a classifier to solve each prediction task when using the representation as input. We demonstrate the potential contribution of the approach by analyzing different sentence representation mechanisms. The analysis sheds light on the relative strengths of different sentence embedding methods with respect to these low level prediction tasks, and on the effect of the encoded vector's dimensionality on the resulting representations.

URL: http://arxiv.org/abs/1608.04207

Notes:

Numerically Grounded Language Models for Semantic Error Correction

Authors: Georgios P. Spithourakis, Isabelle Augenstein, Sebastian Riedel

Abstract: Semantic error detection and correction is an important task for applications such as fact checking, speech-to-text or grammatical error correction. Current approaches generally focus on relatively shallow semantics and do not account for numeric quantities. Our approach uses language models grounded in numbers within the text. Such groundings are easily achieved for recurrent neural language model architectures, which can be further conditioned on incomplete background knowledge bases. Our evaluation on clinical reports shows that numerical grounding improves perplexity by 33% and F1 for semantic error correction by 5 points when compared to ungrounded approaches. Conditioning on a knowledge base yields further improvements.

URL: http://arxiv.org/abs/1608.04147

Notes:

SGDR: Stochastic Gradient Descent with Restarts

Authors: Ilya Loshchilov, Frank Hutter

Abstract: Restart techniques are common in gradient-free optimization to deal with multimodal functions. Partial restarts are also gaining popularity in gradient-based optimization to improve the rate of convergence in accelerated gradient schemes to deal with ill-conditioned functions. In this paper, we propose a simple restart technique for stochastic gradient descent to improve its anytime performance when training deep neural networks. We empirically study its performance on CIFAR-10 and CIFAR-100 datasets where we demonstrate new state-of-the-art results below 4% and 19%, respectively. Our source code is available at this https URL

URL: http://arxiv.org/abs/1608.03983

Notes:

Faster Training of Very Deep Networks Via p-Norm Gates

Authors: Trang Pham, Truyen Tran, Dinh Phung, Svetha Venkatesh

Abstract: A major contributing factor to the recent advances in deep neural networks is structural units that let sensory information and gradients to propagate easily. Gating is one such structure that acts as a flow control. Gates are employed in many recent state-of-the-art recurrent models such as LSTM and GRU, and feedforward models such as Residual Nets and Highway Networks. This enables learning in very deep networks with hundred layers and helps achieve record-breaking results in vision (e.g., ImageNet with Residual Nets) and NLP (e.g., machine translation with GRU). However, there is limited work in analysing the role of gating in the learning process. In this paper, we propose a flexible p-norm gating scheme, which allows user-controllable flow and as a consequence, improve the learning speed. This scheme subsumes other existing gating schemes, including those in GRU, Highway Networks and Residual Nets as special cases. Experiments on large sequence and vector datasets demonstrate that the proposed gating scheme helps improve the learning speed significantly without extra overhead.

URL: http://arxiv.org/abs/1608.03639

Notes:

Scaling Factorial Hidden Markov Models: Stochastic Variational Inference without Messages

Authors: Yin Cheng Ng, Pawel Chilinski, Ricardo Silva

Abstract: Factorial Hidden Markov Models (FHMMs) are powerful models for sequential data but they do not scale well with long sequences. We propose a scalable inference and learning algorithm for FHMMs that draws on ideas from the stochastic variational inference, neural network and copula literatures. Unlike existing approaches, the proposed algorithm requires no message passing procedure among latent variables and can be distributed to a network of computers to speed up learning. Our experiments corroborate that the proposed algorithm does not introduce further approximation bias compared to the proven structured mean-field algorithm, and achieves better performance with long sequences and large FHMMs.

URL: http://arxiv.org/abs/1608.03817

Notes:

Mollifying Networks

Authors: Caglar Gulcehre, Marcin Moczulski, Francesco Visin, Yoshua Bengio

Abstract: The optimization of deep neural networks can be more challenging than traditional convex optimization problems due to the highly non-convex nature of the loss function, e.g. it can involve pathological landscapes such as saddle-surfaces that can be difficult to escape for algorithms based on simple gradient descent. In this paper, we attack the problem of optimization of highly non-convex neural networks by starting with a smoothed — or \textit{mollified} — objective function that gradually has a more non-convex energy landscape during the training. Our proposition is inspired by the recent studies in continuation methods: similar to curriculum methods, we begin learning an easier (possibly convex) objective function and let it evolve during the training, until it eventually goes back to being the original, difficult to optimize, objective function. The complexity of the mollified networks is controlled by a single hyperparameter which is annealed during the training. We show improvements on various difficult optimization tasks and establish a relationship with recent works on continuation methods for neural networks and mollifiers.

URL: http://arxiv.org/abs/1608.04980

Notes:

An Efficient Character-Level Neural Machine Translation

Authors: Shenjian Zhao, Zhihua Zhang

Abstract: Neural machine translation aims at building a single large neural network that can be trained to maximize translation performance. The encoder-decoder architecture with an attention mechanism achieves a translation performance comparable to the existing state-of-the-art phrase-based systems on the task of English-to-French translation. However, the use of large vocabulary becomes the bottleneck in both training and improving the performance. In this paper, we propose an efficient architecture to train a deep character-level neural machine translation by introducing a decimator and an interpolator. The decimator is used to sample the source sequence before encoding while the interpolator is used to resample after decoding. Such a deep model has two major advantages. It avoids the large vocabulary issue radically; at the same time, it is much faster and more memory-efficient in training than conventional character-based models. More interestingly, our model is able to translate the misspelled word like human beings.

URL: http://arxiv.org/abs/1608.04738

Notes:

Efficient Exploration for Dialog Policy Learning with Deep BBQ Networks & Replay Buffer Spiking

Authors: Zachary C. Lipton, Jianfeng Gao, Lihong Li, Xiujun Li, Faisal Ahmed, Li Deng

Abstract: When rewards are sparse and efficient exploration essential, deep Q-learning with ϵ-greedy exploration tends to fail. This poses problems for otherwise promising domains such as task-oriented dialog systems, where the primary reward signal, indicating successful completion, typically occurs only at the end of each episode but depends on the entire sequence of utterances. A poor agent encounters such successful dialogs rarely, and a random agent may never stumble upon a successful outcome in reasonable time. We present two techniques that significantly improve the efficiency of exploration for deep Q-learning agents in dialog systems. First, we demonstrate that exploration by Thompson sampling, using Monte Carlo samples from a Bayes-by-Backprop neural network, yields marked improvement over standard DQNs with Boltzmann or ϵ-greedy exploration. Second, we show that spiking the replay buffer with a small number of successes, as are easy to harvest for dialog tasks, can make Q-learning feasible when it might otherwise fail catastrophically.

URL: http://arxiv.org/abs/1608.05081

Notes:

Recurrent Neural Networks With Limited Numerical Precision

Authors: Joachim Ott, Zhouhan Lin, Ying Zhang, Shih-Chii Liu, Yoshua Bengio

Abstract: Recurrent Neural Networks (RNNs) produce state-of-art performance on many machine learning tasks but their demand on resources in terms of memory and computational power are often high. Therefore, there is a great interest in optimizing the computations performed with these models especially when considering development of specialized low-power hardware for deep networks. One way of reducing the computational needs is to limit the numerical precision of the network weights and biases. This has led to different proposed rounding methods which have been applied so far to only Convolutional Neural Networks and Fully-Connected Networks. This paper addresses the question of how to best reduce weight precision during training in the case of RNNs. We present results from the use of different stochastic and deterministic reduced precision training methods applied to three major RNN types which are then tested on several datasets. The results show that the weight binarization methods do not work with the RNNs. However, the stochastic and deterministic ternarization, and pow2-ternarization methods gave rise to low-precision RNNs that produce similar and even higher accuracy on certain datasets therefore providing a path towards training more efficient implementations of RNNs in specialized hardware.

URL: http://arxiv.org/abs/1608.06902

Notes:

Towards Bayesian Deep Learning: A Framework and Some Existing Methods

Authors: Hao Wang, Dit-Yan Yeung

Abstract: While perception tasks such as visual object recognition and text understanding play an important role in human intelligence, the subsequent tasks that involve inference, reasoning and planning require an even higher level of intelligence. The past few years have seen major advances in many perception tasks using deep learning models. For higher-level inference, however, probabilistic graphical models with their Bayesian nature are still more powerful and flexible. To achieve integrated intelligence that involves both perception and inference, it is naturally desirable to tightly integrate deep learning and Bayesian models within a principled probabilistic framework, which we call Bayesian deep learning. In this unified framework, the perception of text or images using deep learning can boost the performance of higher-level inference and in return, the feedback from the inference process is able to enhance the perception of text or images. This paper proposes a general framework for Bayesian deep learning and reviews its recent applications on recommender systems, topic models, and control. In this paper, we also discuss the relationship and differences between Bayesian deep learning and other related topics like Bayesian treatment of neural networks.

URL: http://arxiv.org/abs/1608.06884

Notes:

Robust Named Entity Recognition in Idiosyncratic Domains

Authors: Sebastian Arnold, Felix A. Gers, Torsten Kilias, Alexander Löser

Abstract: Named entity recognition often fails in idiosyncratic domains. That causes a problem for depending tasks, such as entity linking and relation extraction. We propose a generic and robust approach for high-recall named entity recognition. Our approach is easy to train and offers strong generalization over diverse domain-specific language, such as news documents (e.g. Reuters) or biomedical text (e.g. Medline). Our approach is based on deep contextual sequence learning and utilizes stacked bidirectional LSTM networks. Our model is trained with only few hundred labeled sentences and does not rely on further external knowledge. We report from our results F1 scores in the range of 84-94% on standard datasets.

URL: http://arxiv.org/abs/1608.06757

Notes:

Decoupled Neural Interfaces using Synthetic Gradients

Authors: Max Jaderberg, Wojciech Marian Czarnecki, Simon Osindero, Oriol Vinyals, Alex Graves, Koray Kavukcuoglu

Abstract: Training directed neural networks typically requires forward-propagating data through a computation graph, followed by backpropagating error signal, to produce weight updates. All layers, or more generally, modules, of the network are therefore locked, in the sense that they must wait for the remainder of the network to execute forwards and propagate error backwards before they can be updated. In this work we break this constraint by decoupling modules by introducing a model of the future computation of the network graph. These models predict what the result of the modelled subgraph will produce using only local information. In particular we focus on modelling error gradients: by using the modelled synthetic gradient in place of true backpropagated error gradients we decouple subgraphs, and can update them independently and asynchronously i.e. we realise decoupled neural interfaces. We show results for feed-forward models, where every layer is trained asynchronously, recurrent neural networks (RNNs) where predicting one's future gradient extends the time over which the RNN can effectively model, and also a hierarchical RNN system with ticking at different timescales. Finally, we demonstrate that in addition to predicting gradients, the same framework can be used to predict inputs, resulting in models which are decoupled in both the forward and backwards pass -- amounting to independent networks which co-learn such that they can be composed into a single functioning corporation.

URL: https://arxiv.org/abs/1608.05343

Notes: Потенциально реализация идей этой статьи позволяет ускорить обучение сетей весьма значительно.

A Context-aware Natural Language Generator for Dialogue Systems

Authors: Ondřej Dušek, Filip Jurčíček

Abstract: We present a novel natural language generation system for spoken dialogue systems capable of entraining (adapting) to users' way of speaking, providing contextually appropriate responses. The generator is based on recurrent neural networks and the sequence-to-sequence approach. It is fully trainable from data which include preceding context along with responses to be generated. We show that the context-aware generator yields significant improvements over the baseline in both automatic metrics and a human pairwise preference test.

URL: http://arxiv.org/abs/1608.07076

Notes:

Benchmarking State-of-the-Art Deep Learning Software Tools

Authors: Shaohuai Shi, Qiang Wang, Pengfei Xu, Xiaowen Chu

Abstract: Deep learning has been shown as a successful machine learning method for a variety of tasks, and its popularity results in numerous open-source deep learning software tools coming to public. Training a deep network is usually a very time-consuming process. To address the huge computational challenge in deep learning, many tools exploit hardware features such as multi-core CPUs and many-core GPUs to shorten the training time. However, different tools exhibit different features and running performance when training different types of deep networks on different hardware platforms, which makes it difficult for end users to select an appropriate pair of software and hardware. In this paper, we aim to make a comparative study of the state-of-the-art GPU-accelerated deep learning software tools, including Caffe, CNTK, TensorFlow, and Torch. We benchmark the running performance of these tools with three popular types of neural networks on two CPU platforms and three GPU platforms. Our contribution is two-fold. First, for deep learning end users, our benchmarking results can serve as a guide to selecting appropriate software tool and hardware platform. Second, for deep learning software developers, our in-depth analysis points out possible future directions to further optimize the training performance.

URL: http://arxiv.org/abs/1608.07249

Notes:

Densely Connected Convolutional Networks

Authors: Gao Huang, Zhuang Liu, Kilian Q. Weinberger, Laurens van der Maaten

Abstract: Recent work has shown that convolutional networks can be substantially deeper, more accurate, and efficient to train if they contain shorter connections between layers close to the input and those close to the output. In this paper, we embrace this observation and introduce the Dense Convolutional Network (DenseNet), which connects each layer to every other layer in a feed-forward fashion. Whereas traditional convolutional networks with L layers have L connections - one between each layer and its subsequent layer - our network has L(L+1)/2 direct connections. For each layer, the feature-maps of all preceding layers are used as inputs, and its own feature-maps are used as inputs into all subsequent layers. DenseNets have several compelling advantages: they alleviate the vanishing-gradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters. We evaluate our proposed architecture on four highly competitive object recognition benchmark tasks (CIFAR-10, CIFAR-100, SVHN, and ImageNet). DenseNets obtain significant improvements over the state-of-the-art on most of them, whilst requiring less memory and computation to achieve high performance. Code and models are available at URL .

URL: http://arxiv.org/abs/1608.06993

Notes: old paper on creating more densely connected conv nets, which stated to be related to RNN roll outs

Hash2Vec, Feature Hashing for Word Embeddings

Authors: Luis Argerich, Joaquín Torré Zaffaroni, Matías J Cano

Abstract: In this paper we propose the application of feature hashing to create word embeddings for natural language processing. Feature hashing has been used successfully to create document vectors in related tasks like document classification. In this work we show that feature hashing can be applied to obtain word embeddings in linear time with the size of the data. The results show that this algorithm, that does not need training, is able to capture the semantic meaning of words. We compare the results against GloVe showing that they are similar. As far as we know this is the first application of feature hashing to the word embeddings problem and the results indicate this is a scalable technique with practical results for NLP applications.

URL: http://arxiv.org/abs/1608.08940

Notes:

Stacked Approximated Regression Machine: A Simple Deep Learning Approach

Authors: Zhangyang Wang, Shiyu Chang, Qing Ling, Shuai Huang, Xia Hu, Honghui Shi, Thomas S. Huang

Abstract: This paper proposes the Stacked Approximated Regression Machine (SARM), a novel, simple yet powerful deep learning (DL) baseline. We start by discussing the relationship between regularized regression models and feed-forward networks, with emphasis on the non-negative sparse coding and convolutional sparse coding models. We demonstrate how these models are naturally converted into a unified feed-forward network structure, which coincides with popular DL components. SARM is constructed by stacking multiple unfolded and truncated regression models. Compared to the PCANet, whose feature extraction layers are completely linear, SARM naturally introduces non-linearities, by embedding sparsity regularization. The parameters of SARM are easily obtained, by solving a series of light-weight problems, e.g., PCA or KSVD. Extensive experiments are conducted, which show that SARM outperforms the existing simple deep baseline, PCANet, and is on par with many state-of-the-art deep models, but with much lower computational loads.

URL: http://arxiv.org/abs/1608.04062

Notes:

2016-09

Reward Augmented Maximum Likelihood for Neural Structured Prediction

Authors: Mohammad Norouzi, Samy Bengio, Zhifeng Chen, Navdeep Jaitly, Mike Schuster, Yonghui Wu, Dale Schuurmans

Abstract: A key problem in structured output prediction is direct optimization of the task reward function that matters for test evaluation. This paper presents a simple and computationally efficient approach to incorporate task reward into a maximum likelihood framework. We establish a connection between the log-likelihood and regularized expected reward objectives, showing that at a zero temperature, they are approximately equivalent in the vicinity of the optimal solution. We show that optimal regularized expected reward is achieved when the conditional distribution of the outputs given the inputs is proportional to their exponentiated (temperature adjusted) rewards. Based on this observation, we optimize conditional log-probability of edited outputs that are sampled proportionally to their scaled exponentiated reward. We apply this framework to optimize edit distance in the output label space. Experiments on speech recognition and machine translation for neural sequence to sequence models show notable improvements over a maximum likelihood baseline by using edit distance augmented maximum likelihood.

URL: http://arxiv.org/abs/1609.00150

Notes:

End-to-End Reinforcement Learning of Dialogue Agents for Information Access

Authors: Bhuwan Dhingra, Lihong Li, Xiujun Li, Jianfeng Gao, Yun-Nung Chen, Faisal Ahmed, Li Deng

Abstract: This paper proposes \emph{KB-InfoBot}---a dialogue agent that provides users with an entity from a knowledge base (KB) by interactively asking for its attributes. All components of the KB-InfoBot are trained in an end-to-end fashion using reinforcement learning. Goal-oriented dialogue systems typically need to interact with an external database to access real-world knowledge (e.g. movies playing in a city). Previous systems achieved this by issuing a symbolic query to the database and adding retrieved results to the dialogue state. However, such symbolic operations break the differentiability of the system and prevent end-to-end training of neural dialogue agents. In this paper, we address this limitation by replacing symbolic queries with an induced "soft" posterior distribution over the KB that indicates which entities the user is interested in. We also provide a modified version of the episodic REINFORCE algorithm, which allows the KB-InfoBot to explore and learn both the policy for selecting dialogue acts and the posterior over the KB for retrieving the correct entities. Experimental results show that the end-to-end trained KB-InfoBot outperforms competitive rule-based baselines, as well as agents which are not end-to-end trainable.

URL: http://arxiv.org/abs/1609.00777

Notes:

PMI Matrix Approximations with Applications to Neural Language Modeling

Authors: Oren Melamud, Ido Dagan, Jacob Goldberger

Abstract: The negative sampling (NEG) objective function, used in word2vec, is a simplification of the Noise Contrastive Estimation (NCE) method. NEG was found to be highly effective in learning continuous word representations. However, unlike NCE, it was considered inapplicable for the purpose of learning the parameters of a language model. In this study, we refute this assertion by providing a principled derivation for NEG-based language modeling, founded on a novel analysis of a low-dimensional approximation of the matrix of pointwise mutual information between the contexts and the predicted words. The obtained language modeling is closely related to NCE language models but is based on a simplified objective function. We thus provide a unified formulation for two main language processing tasks, namely word embedding and language modeling, based on the NEG objective function. Experimental results on two popular language modeling benchmarks show comparable perplexity results, with a small advantage to NEG over NCE.

URL: http://arxiv.org/abs/1609.01235

Notes:

Convolutional Neural Networks for Text Categorization: Shallow Word-level vs. Deep Character-level

Authors: Rie Johnson, Tong Zhang

Abstract: This paper reports the performances of shallow word-level convolutional neural networks (CNN), our earlier work (2015), on the eight datasets with relatively large training data that were used for testing the very deep character-level CNN in Conneau et al. (2016). Our findings are as follows. The shallow word-level CNNs achieve better error rates than the error rates reported in Conneau et al., though the results should be interpreted with some consideration due to the unique pre-processing of Conneau et al. The shallow word-level CNN uses more parameters and therefore requires more storage than the deep character-level CNN; however, the shallow word-level CNN computes much faster.

URL: http://arxiv.org/abs/1609.00718

Notes:

Skipping Word: A Character-Sequential Representation based Framework for Question Answering

Authors: Lingxun Meng, Yan Li, Mengyi Liu, Peng Shu

Abstract: Recent works using artificial neural networks based on word distributed representation greatly boost the performance of various natural language learning tasks, especially question answering. Though, they also carry along with some attendant problems, such as corpus selection for embedding learning, dictionary transformation for different learning tasks, etc. In this paper, we propose to straightforwardly model sentences by means of character sequences, and then utilize convolutional neural networks to integrate character embedding learning together with point-wise answer selection training. Compared with deep models pre-trained on word embedding (WE) strategy, our character-sequential representation (CSR) based method shows a much simpler procedure and more stable performance across different benchmarks. Extensive experiments on two benchmark answer selection datasets exhibit the competitive performance compared with the state-of-the-art methods.

URL: http://arxiv.org/abs/1609.00565

Notes:

Hierarchical Multiscale Recurrent Neural Networks

Authors: Junyoung Chung, Sungjin Ahn, Yoshua Bengio

Abstract: Learning both hierarchical and temporal representation has been among the long-standing challenges of recurrent neural networks. Multiscale recurrent neural networks have been considered as a promising approach to resolve this issue, yet there has been a lack of empirical evidence showing that this type of models can actually capture the temporal dependencies by discovering the latent hierarchical structure of the sequence. In this paper, we propose a novel multiscale approach, called the hierarchical multiscale recurrent neural networks, which can capture the latent hierarchical structure in the sequence by encoding the temporal dependencies with different timescales using a novel update mechanism. We show some evidence that our proposed multiscale architecture can discover underlying hierarchical structure in the sequences without using explicit boundary information. We evaluate our proposed model on character-level language modelling and handwriting sequence modelling.

URL: http://arxiv.org/abs/1609.01704

Notes:

Attention-Based Recurrent Neural Network Models for Joint Intent Detection and Slot Filling

Authors: Bing Liu, Ian Lane

Abstract: Attention-based encoder-decoder neural network models have recently shown promising results in machine translation and speech recognition. In this work, we propose an attention-based neural network model for joint intent detection and slot filling, both of which are critical steps for many speech understanding and dialog systems. Unlike in machine translation and speech recognition, alignment is explicit in slot filling. We explore different strategies in incorporating this alignment information to the encoder-decoder framework. Learning from the attention mechanism in encoder-decoder model, we further propose introducing attention to the alignment-based RNN models. Such attentions provide additional information to the intent classification and slot label prediction. Our independent task models achieve state-of-the-art intent detection error rate and slot filling F1 score on the benchmark ATIS task. Our joint training model further obtains 0.56% absolute (23.8% relative) error reduction on intent detection and 0.23% absolute gain on slot filling over the independent task models.

URL: http://arxiv.org/abs/1609.01454

Notes:

Joint Online Spoken Language Understanding and Language Modeling with Recurrent Neural Networks

Authors: Bing Liu, Ian Lane

Abstract: Speaker intent detection and semantic slot filling are two critical tasks in spoken language understanding (SLU) for dialogue systems. In this paper, we describe a recurrent neural network (RNN) model that jointly performs intent detection, slot filling, and language modeling. The neural network model keeps updating the intent estimation as word in the transcribed utterance arrives and uses it as contextual features in the joint model. Evaluation of the language model and online SLU model is made on the ATIS benchmarking data set. On language modeling task, our joint model achieves 11.8% relative reduction on perplexity comparing to the independent training language model. On SLU tasks, our joint model outperforms the independent task training model by 22.3% on intent detection error rate, with slight degradation on slot filling F1 score. The joint model also shows advantageous performance in the realistic ASR settings with noisy speech input.

URL: http://arxiv.org/abs/1609.01462

Notes:

Direct Feedback Alignment Provides Learning in Deep Neural Networks

Authors: Arild Nøkland

Abstract: Artificial neural networks are most commonly trained with the back-propagation algorithm, where the gradient for learning is provided by back-propagating the error, layer by layer, from the output layer to the hidden layers. A recently discovered method called feedback-alignment shows that the weights used for propagating the error backward don't have to be symmetric with the weights used for propagation the activation forward. In fact, random feedback weights work evenly well, because the network learns how to make the feedback useful. In this work, the feedback alignment principle is used for training hidden layers more independently from the rest of the network, and from a zero initial condition. The error is propagated through fixed random feedback connections directly from the output layer to each hidden layer. This simple method is able to achieve zero training error even in convolutional networks and very deep networks, completely without error back-propagation. The method is a step towards biologically plausible machine learning because the error signal is almost local, and no symmetric or reciprocal weights are required. Experiments show that the test performance on MNIST and CIFAR is almost as good as those obtained with back-propagation for fully connected networks. If combined with dropout, the method achieves 1.45% error on the permutation invariant MNIST task.

URL: http://arxiv.org/abs/1609.01596

Notes:

Fitted Learning: Models with Awareness of their Limits

Authors: Navid Kardan, Kenneth O. Stanley

Abstract: Though deep learning has pushed the boundaries of classification forward, in recent years hints of the limits of standard classification have begun to emerge. Problems such as fooling, adding new classes over time, and the need to retrain learning models only for small changes to the original problem all point to a potential shortcoming in the classic classification regime, where a comprehensive a priori knowledge of the possible classes or concepts is critical. Without such knowledge, classifiers misjudge the limits of their knowledge and overgeneralization therefore becomes a serious obstacle to consistent performance. In response to these challenges, this paper extends the classic regime by reframing classification instead with the assumption that concepts present in the training set are only a sample of the hypothetical final set of concepts. To bring learning models into this new paradigm, a novel elaboration of standard architectures called the competitive overcomplete output layer (COOL) neural network is introduced. Experiments demonstrate the effectiveness of COOL by applying it to fooling, separable concept learning, one-class neural networks, and standard classification benchmarks. The results suggest that, unlike conventional classifiers, the amount of generalization in COOL networks can be tuned to match the problem.

URL: http://arxiv.org/abs/1609.02226

Notes:

Discrete Variational Autoencoders

Authors: Jason Tyler Rolfe

Abstract: Probabilistic models with discrete latent variables naturally capture datasets composed of discrete classes. However, they are difficult to train efficiently, since backpropagation through discrete variables is generally not possible. We introduce a novel class of probabilistic models, comprising an undirected discrete component and a directed hierarchical continuous component, that can be trained efficiently using the variational autoencoder framework. The discrete component captures the distribution over the disconnected smooth manifolds induced by the continuous component. As a result, this class of models efficiently learns both the class of objects in an image, and their specific realization in pixels, from unsupervised data; and outperforms state-of-the-art methods on the permutation-invariant MNIST, OMNIGLOT, and Caltech-101 Silhouettes datasets.

URL: http://arxiv.org/abs/1609.02200

Notes:

Ask the GRU: Multi-task Learning for Deep Text Recommendations

Authors: Trapit Bansal, David Belanger, Andrew McCallum

Abstract: In a variety of application domains the content to be recommended to users is associated with text. This includes research papers, movies with associated plot summaries, news articles, blog posts, etc. Recommendation approaches based on latent factor models can be extended naturally to leverage text by employing an explicit mapping from text to factors. This enables recommendations for new, unseen content, and may generalize better, since the factors for all items are produced by a compactly-parametrized model. Previous work has used topic models or averages of word embeddings for this mapping. In this paper we present a method leveraging deep recurrent neural networks to encode the text sequence into a latent vector, specifically gated recurrent units (GRUs) trained end-to-end on the collaborative filtering task. For the task of scientific paper recommendation, this yields models with significantly higher accuracy. In cold-start scenarios, we beat the previous state-of-the-art, all of which ignore word order. Performance is further improved by multi-task learning, where the text encoder network is trained for a combination of content recommendation and item metadata prediction. This regularizes the collaborative filtering model, ameliorating the problem of sparsity of the observed rating matrix.

URL: http://arxiv.org/abs/1609.02116

Notes:

A modular architecture for transparent computation in Recurrent Neural Networks

Authors: Giovanni Sirio Carmantini, Peter beim Graben, Mathieu Desroches, Serafim Rodrigues

Abstract: Computation is classically studied in terms of automata, formal languages and algorithms; yet, the relation between neural dynamics and symbolic representations and operations is still unclear in traditional eliminative connectionism. Therefore, we suggest a unique perspective on this central issue, to which we would like to refer as to transparent connectionism, by proposing accounts of how symbolic computation can be implemented in neural substrates. In this study we first introduce a new model of dynamics on a symbolic space, the versatile shift, showing that it supports the real-time simulation of a range of automata. We then show that the Goedelization of versatile shifts defines nonlinear dynamical automata, dynamical systems evolving on a vectorial space. Finally, we present a mapping between nonlinear dynamical automata and recurrent artificial neural networks. The mapping defines an architecture characterized by its granular modularity, where data, symbolic operations and their control are not only distinguishable in activation space, but also spatially localizable in the network itself, while maintaining a distributed encoding of symbolic representations. The resulting networks simulate automata in real-time and are programmed directly, in absence of network training. To discuss the unique characteristics of the architecture and their consequences, we present two examples: i) the design of a Central Pattern Generator from a finite-state locomotive controller, and ii) the creation of a network simulating a system of interactive automata that supports the parsing of garden-path sentences as investigated in psycholinguistics experiments.

URL: http://arxiv.org/abs/1609.01926

Notes:

Learning Boltzmann Machine with EM-like Method

Authors: Jinmeng Song, Chun Yuan

Abstract: We propose an expectation-maximization-like(EMlike) method to train Boltzmann machine with unconstrained connectivity. It adopts Monte Carlo approximation in the E-step, and replaces the intractable likelihood objective with efficiently computed objectives or directly approximates the gradient of likelihood objective in the M-step. The EM-like method is a modification of alternating minimization. We prove that EM-like method will be the exactly same with contrastive divergence in restricted Boltzmann machine if the M-step of this method adopts special approximation. We also propose a new measure to assess the performance of Boltzmann machine as generative models of data, and its computational complexity is O(Rmn). Finally, we demonstrate the performance of EM-like method using numerical experiments.

URL: http://arxiv.org/abs/1609.01840

Notes:

Polysemous codes

Authors: Matthijs Douze, Hervé Jégou, Florent Perronnin

Abstract: This paper considers the problem of approximate nearest neighbor search in the compressed domain. We introduce polysemous codes, which offer both the distance estimation quality of product quantization and the efficient comparison of binary codes with Hamming distance. Their design is inspired by algorithms introduced in the 90's to construct channel-optimized vector quantizers. At search time, this dual interpretation accelerates the search. Most of the indexed vectors are filtered out with Hamming distance, letting only a fraction of the vectors to be ranked with an asymmetric distance estimator. The method is complementary with a coarse partitioning of the feature space such as the inverted multi-index. This is shown by our experiments performed on several public benchmarks such as the BIGANN dataset comprising one billion vectors, for which we report state-of-the-art results for query times below 0.3,millisecond per core. Last but not least, our approach allows the approximate computation of the k-NN graph associated with the Yahoo Flickr Creative Commons 100M, described by CNN image descriptors, in less than 8 hours on a single machine.

URL: http://arxiv.org/abs/1609.01882

Notes: could be helpful as replacement for KDTree in search for appropriate line

WaveNet: A Generative Model For Raw Audio

Authors: Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonya, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, Koray Kavukcuoglu

Abstract: This paper introduces WaveNet, a deep neural network for generating raw audio waveforms. The model is fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones; nonetheless we show that it can be efficiently trained on data with tens of thousands of samples per second of audio. When applied to text-to-speech, it yields state-of-the-art performance, with human listeners rating it as significantly more natural sounding than the best parametric and concatenative systems for both English and Chinese. A single WaveNet can capture the characteristics of many different speakers with equal fidelity, and can switch between them by conditioning on the speaker identity. When trained to model music, we find that it generates novel and often highly realistic musical fragments. We also show that it can be employed as a discriminative model, returning promising results for phoneme recognition.

URL: https://drive.google.com/file/d/0B3cxcnOkPx9AeWpLVXhkTDJINDQ/view

Notes: Свежая статья от DeepMind, Леша предложил попробовать применить к текстам.

Multimodal Attention for Neural Machine Translation

Authors: Ozan Caglayan, Loïc Barrault, Fethi Bougares

Abstract: The attention mechanism is an important part of the neural machine translation (NMT) where it was reported to produce richer source representation compared to fixed-length encoding sequence-to-sequence models. Recently, the effectiveness of attention has also been explored in the context of image captioning. In this work, we assess the feasibility of a multimodal attention mechanism that simultaneously focus over an image and its natural language description for generating a description in another language. We train several variants of our proposed attention mechanism on the Multi30k multilingual image captioning dataset. We show that a dedicated attention for each modality achieves up to 1.6 points in BLEU and METEOR compared to a textual NMT baseline.

URL: http://arxiv.org/abs/1609.03976

Notes:

An Experimental Study of LSTM Encoder-Decoder Model for Text Simplification

Authors: Tong Wang, Ping Chen, Kevin Amaral, Jipeng Qiang

Abstract: Text simplification (TS) aims to reduce the lexical and structural complexity of a text, while still retaining the semantic meaning. Current automatic TS techniques are limited to either lexical-level applications or manually defining a large amount of rules. Since deep neural networks are powerful models that have achieved excellent performance over many difficult tasks, in this paper, we propose to use the Long Short-Term Memory (LSTM) Encoder-Decoder model for sentence level TS, which makes minimal assumptions about word sequence. We conduct preliminary experiments to find that the model is able to learn operation rules such as reversing, sorting and replacing from sequence pairs, which shows that the model may potentially discover and apply rules such as modifying sentence structure, substituting words, and removing words for TS.

URL: http://arxiv.org/abs/1609.03663

Notes:

Character-Level Language Modeling with Hierarchical Recurrent Neural Networks

Authors: Kyuyeon Hwang, Wonyong Sung

Abstract: Recurrent neural network (RNN) based character-level language models (CLMs) are extremely useful for modeling unseen words by nature. However, their performance is generally much worse than the word-level language models (WLMs), since CLMs need to consider longer history of tokens to properly predict the next one. We address this problem by proposing hierarchical RNN architectures, which consist of multiple modules with different clock rates. Despite the multi-clock structures, the input and output layers operate with the character-level clock, which allows the existing RNN CLM training approaches to be directly applicable without any modifications. Our CLM models show better perplexity than Kneser-Ney (KN) 5-gram WLMs on the One Billion Word Benchmark with only 2% of parameters. Also, we present real-time character-level end-to-end speech recognition examples on the Wall Street Journal (WSJ) corpus, where replacing traditional mono-clock RNN CLMs with the proposed models results in better recognition accuracies even though the number of parameters are reduced to 30%.

URL: http://arxiv.org/abs/1609.03777

Notes:

Factored Neural Machine Translation

Authors: Mercedes García-Martínez, Loïc Barrault, Fethi Bougares

Abstract: We present a new approach for neural machine translation (NMT) using the morphological and grammatical decomposition of the words (factors) in the output side of the neural network. This architecture addresses two main problems occurring in MT, namely dealing with a large target language vocabulary and the out of vocabulary (OOV) words. By the means of factors, we are able to handle larger vocabulary and reduce the training time (for systems with equivalent target language vocabulary size). In addition, we can produce new words that are not in the vocabulary. We use a morphological analyser to get a factored representation of each word (lemmas, Part of Speech tag, tense, person, gender and number). We have extended the NMT approach with attention mechanism in order to have two different outputs, one for the lemmas and the other for the rest of the factors. The final translation is built using some \textit{a priori} linguistic information. We compare our extension with a word-based NMT system. The experiments, performed on the IWSLT'15 dataset translating from English to French, show that while the performance do not always increase, the system can manage a much larger vocabulary and consistently reduce the OOV rate. We observe up to 2% BLEU point improvement in a simulated out of domain translation setup.

URL: http://arxiv.org/abs/1609.04621

Notes:

Learning Text Pair Similarity with Context-sensitive Autoencoders

Authors: Hadi Amiri, Philip Resnik, Jordan Boyd-Graber, Hal Daume III

Abstract: We present a pairwise context-sensitive Autoencoder for computing text pair similarity. Our model encodes input text into context-sensitive representations and uses them to compute similarity between text pairs. Our model outperforms the state-of-the-art models in two semantic retrieval tasks and a contextual word similarity task. For retrieval, our unsupervised approach that merely ranks inputs with respect to the cosine similarity between their hidden representations shows comparable performance with the state-of-the-art supervised models and in some cases outperforms them.

URL: http://www.cs.colorado.edu/~jbg/docs/2016_acl_context_ae.pdf

Notes:

Learning Opposites Using Neural Networks

Authors: Shivam Kalra, Aditya Sriram, Shahryar Rahnamayan, H.R. Tizhoosh

Abstract: Many research works have successfully extended algorithms such as evolutionary algorithms, reinforcement agents and neural networks using "opposition-based learning" (OBL). Two types of the "opposites" have been defined in the literature, namely \textit{type-I} and \textit{type-II}. The former are linear in nature and applicable to the variable space, hence easy to calculate. On the other hand, type-II opposites capture the "oppositeness" in the output space. In fact, type-I opposites are considered a special case of type-II opposites where inputs and outputs have a linear relationship. However, in many real-world problems, inputs and outputs do in fact exhibit a nonlinear relationship. Therefore, type-II opposites are expected to be better in capturing the sense of "opposition" in terms of the input-output relation. In the absence of any knowledge about the problem at hand, there seems to be no intuitive way to calculate the type-II opposites. In this paper, we introduce an approach to learn type-II opposites from the given inputs and their outputs using the artificial neural networks (ANNs). We first perform \emph{opposition mining} on the sample data, and then use the mined data to learn the relationship between input x and its opposite x˘. We have validated our algorithm using various benchmark functions to compare it against an evolving fuzzy inference approach that has been recently introduced. The results show the better performance of a neural approach to learn the opposites. This will create new possibilities for integrating oppositional schemes within existing algorithms promising a potential increase in convergence speed and/or accuracy.

URL: http://arxiv.org/abs/1609.05123

Notes:

Enhancing and Combining Sequential and Tree LSTM for Natural Language Inference

Authors: Qian Chen, Xiaodan Zhu, Zhenhua Ling, Si Wei, Hui Jiang

Abstract: Reasoning and inference are central to human and artificial intelligence. Modeling inference in human language is notoriously challenging but is fundamental to natural language understanding and many applications. With the availability of large annotated data, neural network models have recently advanced the field significantly. In this paper, we present a new state-of-the-art result, achieving the accuracy of 88.3% on the standard benchmark, the Stanford Natural Language Inference dataset. This result is achieved first through our enhanced sequential encoding model, which outperforms the previous best model that employs more complicated network architectures, suggesting that the potential of sequential LSTM-based models have not been fully explored yet in previous work. We further show that by explicitly considering recursive architectures, we achieve additional improvement. Particularly, incorporating syntactic parse information contributes to our best result; it improves the performance even when the parse information is added to an already very strong system.

URL: http://arxiv.org/abs/1609.06038

Notes:

Learning Robust Representations of Text

Authors: Yitong Li, Trevor Cohn, Timothy Baldwin

Abstract: Deep neural networks have achieved remarkable results across many language processing tasks, however these methods are highly sensitive to noise and adversarial attacks. We present a regularization based method for limiting network sensitivity to its inputs, inspired by ideas from computer vision, thus learning models that are more robust. Empirical evaluation over a range of sentiment datasets with a convolutional neural network shows that, compared to a baseline model and the dropout method, our method achieves superior performance over noisy inputs and out-of-domain data.

URL: http://arxiv.org/abs/1609.06082

Notes:

A Cheap Linear Attention Mechanism with Fast Lookups and Fixed-Size Representations

Authors: Alexandre de Brébisson, Pascal Vincent

Abstract: The softmax content-based attention mechanism has proven to be very beneficial in many applications of recurrent neural networks. Nevertheless it suffers from two major computational limitations. First, its computations for an attention lookup scale linearly in the size of the attended sequence. Second, it does not encode the sequence into a fixed-size representation but instead requires to memorize all the hidden states. These two limitations restrict the use of the softmax attention mechanism to relatively small-scale applications with short sequences and few lookups per sequence. In this work we introduce a family of linear attention mechanisms designed to overcome the two limitations listed above. We show that removing the softmax non-linearity from the traditional attention formulation yields constant-time attention lookups and fixed-size representations of the attended sequences. These properties make these linear attention mechanisms particularly suitable for large-scale applications with extreme query loads, real-time requirements and memory constraints. Early experiments on a question answering task show that these linear mechanisms yield significantly better accuracy results than no attention, but obviously worse than their softmax alternative.

URL: http://arxiv.org/abs/1609.05866

Notes: Надо посмотреть, здесь кажется обещают простой в реализации attention.

ReasoNet: Learning to Stop Reading in Machine Comprehension

Authors: Yelong Shen, Po-Sen Huang, Jianfeng Gao, Weizhu Chen

Abstract: Teaching a computer to read a document and answer general questions pertaining to the document is a challenging yet unsolved problem. In this paper, we describe a novel neural network architecture called Reasoning Network ({ReasoNet}) for machine comprehension tasks. ReasoNet makes use of multiple turns to effectively exploit and then reason over the relation among queries, documents, and answers. Different from previous approaches using a fixed number of turns during inference, ReasoNet introduces a termination state to relax this constraint on the reasoning depth. With the use of reinforcement learning, ReasoNet can dynamically determine whether to continue the comprehension process after digesting intermediate results, or to terminate reading when it concludes that existing information is adequate to produce an answer. ReasoNet has achieved state-of-the-art performance in machine comprehension datasets, including unstructured CNN and Daily Mail datasets, and a structured Graph Reachability dataset.

URL: http://arxiv.org/abs/1609.05284

Notes:

Select-Additive Learning: Improving Cross-individual Generalization in Multimodal Sentiment Analysis

Authors: Haohan Wang, Aaksha Meghawat, Louis-Philippe Morency, Eric P. Xing

Abstract: Multimodal sentiment analysis is drawing an increasing amount of attention these days. It enables mining of opinions in video reviews and surveys which are now available aplenty on online platforms like YouTube. However, the limited number of high-quality multimodal sentiment data samples may introduce the problem of the sentiment being dependent on the individual specific features in the dataset. This results in a lack of generalizability of the trained models for classification on larger online platforms. In this paper, we first examine the data and verify the existence of this dependence problem. Then we propose a Select-Additive Learning (SAL) procedure that improves the generalizability of trained discriminative neural networks. SAL is a two-phase learning method. In Selection phase, it selects the confounding learned representation. In Addition phase, it forces the classifier to discard confounded representations by adding Gaussian noise. In our experiments, we show how SAL improves the generalizability of state-of-the-art models. We increase prediction accuracy significantly in all three modalities (text, audio, video), as well as in their fusion. We show how SAL, even when trained on one dataset, achieves good accuracy across test datasets.

URL: http://arxiv.org/abs/1609.05244

Notes:

Sparse Boltzmann Machines with Structure Learning as Applied to Text Analysis

Authors: Zhourong Chen, Nevin L. Zhang, Dit-Yan Yeung, Peixian Chen

Abstract: We are interested in exploring the possibility and benefits of structure learning for deep models. As the first step, this paper investigates the matter for Restricted Boltzmann Machines (RBMs). We conduct the study with Replicated Softmax, a variant of RBMs for unsupervised text analysis. We present a method for learning what we call Sparse Boltzmann Machines, where each hidden unit is connected to a subset of the visible units instead of all of them. Empirical results show that the method yields models with significantly improved model fit and interpretability as compared with RBMs where each hidden unit is connected to all visible units.

URL: http://arxiv.org/abs/1609.05294

Notes:

SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient

Authors: Lantao Yu, Weinan Zhang, Jun Wang, Yong Yu

Abstract: As a new way of training generative models, Generative Adversarial Nets (GAN) that uses a discriminative model to guide the training of the generative model has enjoyed considerable success in generating real-valued data. However, it has limitations when the goal is for generating sequences of discrete tokens. A major reason lies in that the discrete outputs from the generative model make it difficult to pass the gradient update from the discriminative model to the generative model. Also, the discriminative model can only assess a complete sequence, while for a partially generated sequence, it is non-trivial to balance its current score and the future one once the entire sequence has been generated. In this paper, we propose a sequence generation framework, called SeqGAN, to solve the problems. Modeling the data generator as a stochastic policy in reinforcement learning (RL), SeqGAN bypasses the generator differentiation problem by directly performing gradient policy update. The RL reward signal comes from the GAN discriminator judged on a complete sequence, and is passed back to the intermediate state-action steps using Monte Carlo search. Extensive experiments on synthetic data and real-world tasks demonstrate significant improvements over strong baselines.

URL: http://arxiv.org/abs/1609.05473

Notes:

Interactive Spoken Content Retrieval by Deep Reinforcement Learning

Authors: Yen-Chen Wu, Tzu-Hsiang Lin, Yang-De Chen, Hung-Yi Lee, Lin-Shan Lee

Abstract: User-machine interaction is important for spoken content retrieval. For text content retrieval, the user can easily scan through and select on a list of retrieved item. This is impossible for spoken content retrieval, because the retrieved items are difficult to show on screen. Besides, due to the high degree of uncertainty for speech recognition, the retrieval results can be very noisy. One way to counter such difficulties is through user-machine interaction. The machine can take different actions to interact with the user to obtain better retrieval results before showing to the user. The suitable actions depend on the retrieval status, for example requesting for extra information from the user, returning a list of topics for user to select, etc. In our previous work, some hand-crafted states estimated from the present retrieval results are used to determine the proper actions. In this paper, we propose to use Deep-Q-Learning techniques instead to determine the machine actions for interactive spoken content retrieval. Deep-Q-Learning bypasses the need for estimation of the hand-crafted states, and directly determine the best action base on the present retrieval status even without any human knowledge. It is shown to achieve significantly better performance compared with the previous hand-crafted states.

URL: http://arxiv.org/abs/1609.05234

Notes:

Label-Free Supervision of Neural Networks with Physics and Domain Knowledge

Authors: Russell Stewart, Stefano Ermon

Abstract: In many machine learning applications, labeled data is scarce and obtaining more labels is expensive. We introduce a new approach to supervising neural networks by specifying constraints that should hold over the output space, rather than direct examples of input-output pairs. These constraints are derived from prior domain knowledge, e.g., from known laws of physics. We demonstrate the effectiveness of this approach on real world and simulated computer vision tasks. We are able to train a convolutional neural network to detect and track objects without any labeled examples. Our approach can significantly reduce the need for labeled training data, but introduces new challenges for encoding prior knowledge into appropriate loss functions.

URL: http://arxiv.org/abs/1609.05566

Notes:

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Authors: Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, Jeffrey Dean

Abstract: Neural Machine Translation (NMT) is an end-to-end learning approach for automated translation, with the potential to overcome many of the weaknesses of conventional phrase-based translation systems. Unfortunately, NMT systems are known to be computationally expensive both in training and in translation inference. Also, most NMT systems have difficulty with rare words. These issues have hindered NMT's use in practical deployments and services, where both accuracy and speed are essential. In this work, we present GNMT, Google's Neural Machine Translation system, which attempts to address many of these issues. Our model consists of a deep LSTM network with 8 encoder and 8 decoder layers using attention and residual connections. To improve parallelism and therefore decrease training time, our attention mechanism connects the bottom layer of the decoder to the top layer of the encoder. To accelerate the final translation speed, we employ low-precision arithmetic during inference computations. To improve handling of rare words, we divide words into a limited set of common sub-word units ("wordpieces") for both input and output. This method provides a good balance between the flexibility of "character"-delimited models and the efficiency of "word"-delimited models, naturally handles translation of rare words, and ultimately improves the overall accuracy of the system. Our beam search technique employs a length-normalization procedure and uses a coverage penalty, which encourages generation of an output sentence that is most likely to cover all the words in the source sentence. On the WMT'14 English-to-French and English-to-German benchmarks, GNMT achieves competitive results to state-of-the-art. Using a human side-by-side evaluation on a set of isolated simple sentences, it reduces translation errors by an average of 60% compared to Google's phrase-based production system.

URL: http://arxiv.org/abs/1609.08144

Notes:

Lattice-Based Recurrent Neural Network Encoders for Neural Machine Translation

Authors: Jinsong Su, Zhixing Tan, Deyi Xiong, Yang Liu

Abstract: Neural machine translation (NMT) heavily relies on word level modelling to learn semantic representations of input sentences. However, for languages without natural word delimiters (e.g., Chinese) where input sentences have to be tokenized first, conventional NMT is confronted with two issues: 1) it is difficult to find an optimal tokenization granularity for source sentence modelling, and 2) errors in 1-best tokenizations may propagate to the encoder of NMT. To handle these issues, we propose word-lattice based Recurrent Neural Network (RNN) encoders for NMT, which generalize the standard RNN to word lattice topology. The proposed encoders take as input a word lattice that compactly encodes multiple tokenizations, and learn to generate new hidden states from arbitrarily many inputs and hidden states in preceding time steps. As such, the word-lattice based encoders not only alleviate the negative impact of tokenization errors but also are more expressive and flexible to embed input sentences. Experiment results on Chinese-English translation demonstrate the superiorities of the proposed encoders over the conventional encoder.

URL: http://arxiv.org/abs/1609.07730

Notes:

Pointer Sentinel Mixture Models

Authors: Stephen Merity, Caiming Xiong, James Bradbury, Richard Socher

Abstract: Recent neural network sequence models with softmax classifiers have achieved their best language modeling performance only with very large hidden states and large vocabularies. Even then they struggle to predict rare or unseen words even if the context makes the prediction unambiguous. We introduce the pointer sentinel mixture architecture for neural sequence models which has the ability to either reproduce a word from the recent context or produce a word from a standard softmax classifier. Our pointer sentinel-LSTM model achieves state of the art language modeling performance on the Penn Treebank (70.9 perplexity) while using far fewer parameters than a standard softmax LSTM. In order to evaluate how well language models can exploit longer contexts and deal with more realistic vocabularies and larger corpora we also introduce the freely available WikiText corpus.

URL: http://arxiv.org/abs/1609.07843

Notes: the idea is to add pointer loss, i.e. when a word is repeated we could know that, pointing to the previous occurence of it, and base of softmax brings the universality to the mixture model

Creating Causal Embeddings for Question Answering with Minimal Supervision

Authors: Rebecca Sharp, Mihai Surdeanu, Peter Jansen, Peter Clark, Michael Hammond

Abstract: A common model for question answering (QA) is that a good answer is one that is closely related to the question, where relatedness is often determined using general-purpose lexical models such as word embeddings. We argue that a better approach is to look for answers that are related to the question in a relevant way, according to the information need of the question, which may be determined through task-specific embeddings. With causality as a use case, we implement this insight in three steps. First, we generate causal embeddings cost-effectively by bootstrapping cause-effect pairs extracted from free text using a small set of seed patterns. Second, we train dedicated embeddings over this data, by using task-specific contexts, i.e., the context of a cause is its effect. Finally, we extend a state-of-the-art reranking approach for QA to incorporate these causal embeddings. We evaluate the causal embedding models both directly with a casual implication task, and indirectly, in a downstream causal QA task using data from Yahoo! Answers. We show that explicitly modeling causality improves performance in both tasks. In the QA task our best model achieves 37.3% P@1, significantly outperforming a strong baseline by 7.7% (relative).

URL: http://arxiv.org/abs/1609.08097

Notes:

Toward Socially-Infused Information Extraction: Embedding Authors, Mentions, and Entities

Authors: Yi Yang, Ming-Wei Chang, Jacob Eisenstein

Abstract: Entity linking is the task of identifying mentions of entities in text, and linking them to entries in a knowledge base. This task is especially difficult in microblogs, as there is little additional text to provide disambiguating context; rather, authors rely on an implicit common ground of shared knowledge with their readers. In this paper, we attempt to capture some of this implicit context by exploiting the social network structure in microblogs. We build on the theory of homophily, which implies that socially linked individuals share interests, and are therefore likely to mention the same sorts of entities. We implement this idea by encoding authors, mentions, and entities in a continuous vector space, which is constructed so that socially-connected authors have similar vector representations. These vectors are incorporated into a neural structured prediction model, which captures structural constraints that are inherent in the entity linking task. Together, these design decisions yield F1 improvements of 1%-5% on benchmark datasets, as compared to the previous state-of-the-art.

URL: http://arxiv.org/abs/1609.08084

Notes:

Language as a Latent Variable: Discrete Generative Models for Sentence Compression

Authors: Yishu Miao, Phil Blunsom

Abstract: In this work we explore deep generative models of text in which the latent representation of a document is itself drawn from a discrete language model distribution. We formulate a variational auto-encoder for inference in this model and apply it to the task of compressing sentences. In this application the generative model first draws a latent summary sentence from a background language model, and then subsequently draws the observed sentence conditioned on this latent summary. In our empirical evaluation we show that generative formulations of both abstractive and extractive compression yield state-of-the-art results when trained on a large amount of supervised data. Further, we explore semi-supervised compression scenarios where we show that it is possible to achieve performance competitive with previously proposed supervised models while training on a fraction of the supervised data.

URL: http://arxiv.org/abs/1609.07317

Notes:

Deep Reinforcement Learning for Mention-Ranking Coreference Models

Authors: Kevin Clark, Christopher D. Manning

Abstract: Coreference resolution systems are typically trained with heuristic loss functions that require careful tuning. In this paper we instead apply reinforcement learning to directly optimize a neural mention-ranking model for coreference evaluation metrics. We experiment with two approaches: the REINFORCE policy gradient algorithm and a reward-rescaled max-margin objective. We find the latter to be more effective, resulting in significant improvements over the current state-of-the-art on the English and Chinese portions of the CoNLL 2012 Shared Task.

URL: http://arxiv.org/abs/1609.08667

Notes:

Unsupervised Neural Hidden Markov Models

Authors: Ke Tran, Yonatan Bisk, Ashish Vaswani, Daniel Marcu, Kevin Knight

Abstract: In this work, we present the first results for neuralizing an Unsupervised Hidden Markov Model. We evaluate our approach on tag induction. Our approach outperforms existing generative models and is competitive with the state-of-the-art though with a simpler model easily extended to include additional context.

URL: http://arxiv.org/abs/1609.09007

Notes:

Hierarchical Memory Networks for Answer Selection on Unknown Words

Authors: Jiaming Xu, Jing Shi, Yiqun Yao, Suncong Zheng, Bo Xu, Bo Xu

Abstract: Recently, end-to-end memory networks have shown promising results on Question Answering task, which encode the past facts into an explicit memory and perform reasoning ability by making multiple computational steps on the memory. However, memory networks conduct the reasoning on sentence-level memory to output coarse semantic vectors and do not further take any attention mechanism to focus on words, which may lead to the model lose some detail information, especially when the answers are rare or unknown words. In this paper, we propose a novel Hierarchical Memory Networks, dubbed HMN. First, we encode the past facts into sentence-level memory and word-level memory respectively. Then, (k)-max pooling is exploited following reasoning module on the sentence-level memory to sample the (k) most relevant sentences to a question and feed these sentences into attention mechanism on the word-level memory to focus the words in the selected sentences. Finally, the prediction is jointly learned over the outputs of the sentence-level reasoning module and the word-level attention mechanism. The experimental results demonstrate that our approach successfully conducts answer selection on unknown words and achieves a better performance than memory networks.

URL: http://arxiv.org/abs/1609.08843

Notes:

HyperNetworks

Authors: David Ha, Andrew Dai, Quoc V. Le

Abstract: This work explores hypernetworks: an approach of using a small network, also known as a hypernetwork, to generate the weights for a larger network. Hypernetworks provide an abstraction that is similar to what is found in nature: the relationship between a genotype - the hypernetwork - and a phenotype - the main network. Though they are also reminiscent of HyperNEAT in evolution, our hypernetworks are trained end-to-end with backpropagation and thus are usually faster. The focus of this work is to make hypernetworks useful for deep convolutional networks and long recurrent networks, where hypernetworks can be viewed as relaxed form of weight-sharing across layers. Our main result is that hypernetworks can generate non-shared weights for LSTM and achieve state-of-art results on a variety of language modeling tasks with Character-Level Penn Treebank and Hutter Prize Wikipedia datasets, challenging the weight-sharing paradigm for recurrent networks. Our results also show that hypernetworks applied to convolutional networks still achieve respectable results for image recognition tasks compared to state-of-the-art baseline models while requiring fewer learnable parameters.

URL: https://arxiv.org/abs/1609.09106

Notes:

Semantic Parsing with Semi-Supervised Sequential Autoencoders

Authors: Tomáš Kočiský, Gábor Melis, Edward Grefenstette, Chris Dyer, Wang Ling, Phil Blunsom, Karl Moritz Hermann

Abstract: We present a novel semi-supervised approach for sequence transduction and apply it to semantic parsing. The unsupervised component is based on a generative model in which latent sentences generate the unpaired logical forms. We apply this method to a number of semantic parsing tasks focusing on domains with limited access to labelled training data and extend those datasets with synthetically generated logical forms.

URL: https://arxiv.org/abs/1609.09315

Notes:

Inducing Multilingual Text Analysis Tools Using Bidirectional Recurrent Neural Networks

Authors: Othman Zennaki, Nasredine Semmar, Laurent Besacier

Abstract: This work focuses on the rapid development of linguistic annotation tools for resource-poor languages. We experiment several cross-lingual annotation projection methods using Recurrent Neural Networks (RNN) models. The distinctive feature of our approach is that our multilingual word representation requires only a parallel corpus between the source and target language. More precisely, our method has the following characteristics: (a) it does not use word alignment information, (b) it does not assume any knowledge about foreign languages, which makes it applicable to a wide range of resource-poor languages, (c) it provides truly multilingual taggers. We investigate both uni- and bi-directional RNN models and propose a method to include external information (for instance low level information from POS) in the RNN to train higher level taggers (for instance, super sense taggers). We demonstrate the validity and genericity of our model by using parallel corpora (obtained by manual or automatic translation). Our experiments are conducted to induce cross-lingual POS and super sense taggers.

URL: https://arxiv.org/abs/1609.09382

Notes:

Efficient softmax approximation for GPUs

Authors: Edouard Grave, Armand Joulin, Moustapha Cissé, David Grangier, Hervé Jégou

Abstract: We propose an approximate strategy to efficiently train neural network based language models over very large vocabularies. Our approach, called adaptive softmax, circumvents the linear dependency on the vocabulary size by exploiting the unbalanced word distribution to form clusters that explicitly minimize the expectation of computational complexity. Our approach further reduces the computational cost by exploiting the specificities of modern architectures and matrix-matrix vector operations, making it particularly suited for graphical processing units. Our experiments carried out on standard benchmarks, such as EuroParl and One Billion Word, show that our approach brings a large gain in efficiency over standard approximations while achieving an accuracy close to that of the full softmax.

URL: https://arxiv.org/abs/1609.04309

Notes: splitting all the words by clusters of frequencies gives the performance boost

Multiplicative LSTM for sequence modelling

Authors: Ben Krause, Liang Lu, Iain Murray, Steve Renals

Abstract: This paper introduces multiplicative LSTM, a novel hybrid recurrent neural network architecture for sequence modelling that combines the long short-term memory (LSTM) and multiplicative recurrent neural network architectures. Multiplicative LSTM is motivated by its flexibility to have very different recurrent transition functions for each possible input, which we argue helps make it more expressive in autoregressive density estimation. We show empirically that multiplicative LSTM outperforms standard LSTM and its deep variants for a range of character level modelling tasks. We also found that this improvement increases as the complexity of the task scales up. This model achieves a test error of 1.19 bits/character on the last 4 million characters of the Hutter prize dataset when combined with dynamic evaluation.

URL: https://arxiv.org/abs/1609.07959

Notes: I've missed this paper last year, but now it is mentioned in Sutskever's new paper about sentiment; the main idea on mRNN (and subsequently mLSTM) - the decomposition of weight matrix to two matrices: one FC on hidden2hidden and one diag input-dependent

2016-10

Empirical Evaluation of RNN Architectures on Sentence Classification Task

Authors: Lei Shen, Junlin Zhang

Abstract: Recurrent Neural Networks have achieved state-of-the-art results for many problems in NLP and two most popular RNN architectures are Tail Model and Pooling Model. In this paper, a hybrid architecture is proposed and we present the first empirical study using LSTMs to compare performance of the three RNN structures on sentence classification task. Experimental results show that the Tail Model and Hybrid Model consistently get a better performance over Pooling Model, and Hybrid Model is comparable with Tail Model.

URL: https://arxiv.org/abs/1609.09171

Notes:

Learning to Translate in Real-time with Neural Machine Translation

Authors: Jiatao Gu, Graham Neubig, Kyunghyun Cho, Victor O.K. Li

Abstract: Translating in real-time, a.k.a. simultaneous translation, outputs translation words before the input sentence ends, which is a challenging problem for conventional machine translation methods. We propose a neural machine translation (NMT) framework for simultaneous translation in which an agent learns to make decisions on when to translate from the interaction with a pre-trained NMT environment. To trade off quality and delay, we extensively explore various targets for delay and design a method for beam-search applicable in the simultaneous MT setting. Experiments against state-of-the-art baselines on two language pairs demonstrate the efficacy of the proposed framework both quantitatively and qualitatively.

URL: https://arxiv.org/abs/1610.00388

Notes:

Vocabulary Selection Strategies for Neural Machine Translation

Authors: Gurvan L'Hostis, David Grangier, Michael Auli

Abstract: Classical translation models constrain the space of possible outputs by selecting a subset of translation rules based on the input sentence. Recent work on improving the efficiency of neural translation models adopted a similar strategy by restricting the output vocabulary to a subset of likely candidates given the source. In this paper we experiment with context and embedding-based selection methods and extend previous work by examining speed and accuracy trade-offs in more detail. We show that decoding time on CPUs can be reduced by up to 90% and training time by 25% on the WMT15 English-German and WMT16 English-Romanian tasks at the same or only negligible change in accuracy. This brings the time to decode with a state of the art neural translation system to just over 140 msec per sentence on a single CPU core for English-German.

URL: https://arxiv.org/abs/1610.00072

Notes:

Sentence Segmentation in Narrative Transcripts from Neuropsycological Tests using Recurrent Convolutional Neural Networks

Authors: Marcos Vinícius Treviso, Christopher Shulby, Sandra Maria Aluísio

Abstract: Automated discourse analysis tools based on Natural Language Processing (NLP) aiming at the diagnosis of language-impairing dementias generally extract several textual metrics of narrative transcripts. However, the absence of sentence boundary segmentation in the transcripts prevents the direct application of NLP methods which rely on these marks in order to function properly, such as taggers and parsers. We present the first steps taken towards automatic neuropsychological evaluation based on narrative discourse analysis, presenting a new automatic sentence segmentation method for impaired speech. Our model uses recurrent convolutional neural networks with prosodic, Part of Speech (PoS) features, and word embeddings. It was evaluated intrinsically on impaired, spontaneous speech as well as normal, prepared speech. The results suggest that our model is robust for impaired speech and can be used in automated discourse analysis tools to differentiate narratives produced by Mild Cognitive Impairment and healthy elderly patients.

URL: https://arxiv.org/abs/1610.00211

Notes:

Comparative study of LSA vs Word2vec embeddings in small corpora: a case study in dreams database

Authors: Edgar Altszyler, Mariano Sigman, Diego Fernández Slezak

Abstract: Word embeddings have been extensively studied in large text datasets. However, only a few studies analyze semantic representations of small corpora, particularly relevant in single-person text production studies. In the present paper, we compare Skip-gram and LSA capabilities in this scenario, and we test both techniques to extract relevant semantic patterns in single-series dreams reports. LSA showed better performance than Skip-gram in small size training corpus in two semantic tests. As a study case, we show that LSA can capture relevant words associations in dream reports series, even in cases of small number of dreams or low-frequency words. We propose that LSA can be used to explore words associations in dreams reports, which could bring new insight into this classic research area of psychology

URL: https://arxiv.org/abs/1610.01520

Notes:

Word2Vec vs DBnary: Augmenting METEOR using Vector Representations or Lexical Resources?

Authors: Christophe Servan, Alexandre Berard, Zied Elloumi, Hervé Blanchon, Laurent Besacier

Abstract: This paper presents an approach combining lexico-semantic resources and distributed representations of words applied to the evaluation in machine translation (MT). This study is made through the enrichment of a well-known MT evaluation metric: METEOR. This metric enables an approximate match (synonymy or morphological similarity) between an automatic and a reference translation. Our experiments are made in the framework of the Metrics task of WMT 2014. We show that distributed representations are a good alternative to lexico-semantic resources for MT evaluation and they can even bring interesting additional information. The augmented versions of METEOR, using vector representations, are made available on our Github page.

URL: https://arxiv.org/abs/1610.01291

Notes:

Understanding intermediate layers using linear classifier probes

Authors: Guillaume Alain, Yoshua Bengio

Abstract: Neural network models have a reputation for being black boxes. We propose a new method to understand better the roles and dynamics of the intermediate layers. This has direct consequences on the design of such models and it enables the expert to be able to justify certain heuristics (such as the auxiliary heads in the Inception model). Our method uses linear classifiers, referred to as "probes", where a probe can only use the hidden units of a given intermediate layer as discriminating features. Moreover, these probes cannot affect the training phase of a model, and they are generally added after training. They allow the user to visualize the state of the model at multiple steps of training. We demonstrate how this can be used to develop a better intuition about a known model and to diagnose potential problems.

URL: https://arxiv.org/abs/1610.01644

Notes:

Neural-based Noise Filtering from Word Embeddings

Authors: Kim Anh Nguyen, Sabine Schulte im Walde, Ngoc Thang Vu

Abstract: Word embeddings have been demonstrated to benefit NLP tasks impressively. Yet, there is room for improvement in the vector representations, because current word embeddings typically contain unnecessary information, i.e., noise. We propose two novel models to improve word embeddings by unsupervised learning, in order to yield word denoising embeddings. The word denoising embeddings are obtained by strengthening salient information and weakening noise in the original word embeddings, based on a deep feed-forward neural network filter. Results from benchmark tasks show that the filtered word denoising embeddings outperform the original word embeddings.

URL: https://arxiv.org/abs/1610.01874

Notes:

Morphology Generation for Statistical Machine Translation using Deep Learning Techniques

Authors: Marta R. Costa-jussà, Carlos Escolano

Abstract: Morphology unbalanced languages remains a big challenge in the context of machine translation. In this paper, we propose to de-couple machine translation from morphology generation in order to better deal with the problem. We investigate the morphology simplification with a reasonable trade-off between expected gain and generation complexity. For the Chinese-Spanish task, optimum morphological simplification is in gender and number. For this purpose, we design a new classification architecture which, compared to other standard machine learning techniques, obtains the best results. This proposed neural-based architecture consists of several layers: an embedding, a convolutional followed by a recurrent neural network and, finally, ends with sigmoid and softmax layers. We obtain classification results over 98% accuracy in gender classification, over 93% in number classification, and an overall translation improvement of 0.7 METEOR.

URL: https://arxiv.org/abs/1610.02209

Notes:

There's No Comparison: Reference-less Evaluation Metrics in Grammatical Error Correction

Authors: Courtney Napoles, Keisuke Sakaguchi, Joel Tetreault

Abstract: Current methods for automatically evaluating grammatical error correction (GEC) systems rely on gold-standard references. However, these methods suffer from penalizing grammatical edits that are correct but not in the gold standard. We show that reference-less grammaticality metrics correlate very strongly with human judgments and are competitive with the leading reference-based evaluation metrics. By interpolating both methods, we achieve state-of-the-art correlation with human judgments. Finally, we show that GEC metrics are much more reliable when they are calculated at the sentence level instead of the corpus level. We have set up a CodaLab site for benchmarking GEC output using a common dataset and different evaluation metrics.

URL: https://arxiv.org/abs/1610.02124

Notes:

Language Models with GloVe Word Embeddings

Authors: Victor Makarenkov, Bracha Shapira, Lior Rokach

Abstract: In this work we implement a training of a Language Model (LM), using Recurrent Neural Network (RNN) and GloVe word embeddings, introduced by Pennigton et al. in [1]. The implementation is following the general idea of training RNNs for LM tasks presented in [2], but is rather using Gated Recurrent Unit (GRU) [3] for a memory cell, and not the more commonly used LSTM [4].

URL: https://arxiv.org/abs/1610.03759

Notes:

Learning in Implicit Generative Models

Authors: Shakir Mohamed, Balaji Lakshminarayanan

Abstract: Generative adversarial networks (GANs) provide an algorithmic framework for constructing generative models with several appealing properties: they do not require a likelihood function to be specified, only a generating procedure; they provide samples that are sharp and compelling; and they allow us to harness our knowledge of building highly accurate neural network classifiers. Here, we develop our understanding of GANs with the aim of forming a rich view of this growing area of machine learning---to build connections to the diverse set of statistical thinking on this topic, of which much can be gained by a mutual exchange of ideas. We frame GANs within the wider landscape of algorithms for learning in implicit generative models--models that only specify a stochastic procedure with which to generate data--and relate these ideas to modelling problems in related fields, such as econometrics and approximate Bayesian computation. We develop likelihood-free inference methods and highlight hypothesis testing as a principle for learning in implicit generative models, using which we are able to derive the objective function used by GANs, and many other related objectives. The testing viewpoint directs our focus to the general problem of density ratio estimation. There are four approaches for density ratio estimation, one of which is a solution using classifiers to distinguish real from generated data. Other approaches such as divergence minimisation and moment matching have also been explored in the GAN literature, and we synthesise these views to form an understanding in terms of the relationships between them and the wider literature, highlighting avenues for future exploration and cross-pollination.

URL: https://arxiv.org/abs/1610.03483

Notes:

Neural Paraphrase Generation with Stacked Residual LSTM Networks

Authors: Aaditya Prakash, Sadid A. Hasan, Kathy Lee, Vivek Datla, Ashequl Qadir, Joey Liu, Oladimeji Farri

Abstract: In this paper, we propose a novel neural approach for paraphrase generation. Conventional para- phrase generation methods either leverage hand-written rules and thesauri-based alignments, or use statistical machine learning principles. To the best of our knowledge, this work is the first to explore deep learning models for paraphrase generation. Our primary contribution is a stacked residual LSTM network, where we add residual connections between LSTM layers. This allows for efficient training of deep LSTMs. We evaluate our model and other state-of-the-art deep learning models on three different datasets: PPDB, WikiAnswers and MSCOCO. Evaluation results demonstrate that our model outperforms sequence to sequence, attention-based and bi- directional LSTM models on BLEU, METEOR, TER and an embedding-based sentence similarity metric.

URL: https://arxiv.org/abs/1610.03098

Notes:

Navigational Instruction Generation as Inverse Reinforcement Learning with Neural Machine Translation

Authors: Andrea F. Daniele, Mohit Bansal, Matthew R. Walter

Abstract: Modern robotics applications that involve human-robot interaction require robots to be able to communicate with humans seamlessly and effectively. Natural language provides a flexible and efficient medium through which robots can exchange information with their human partners. Significant advancements have been made in developing robots capable of interpreting free-form instructions, but less attention has been devoted to endowing robots with the ability to generate natural language. We propose a navigational guide model that enables robots to generate natural language instructions that allow humans to navigate a priori unknown environments. We first decide which information to share with the user according to their preferences, using a policy trained from human demonstrations via inverse reinforcement learning. We then "translate" this information into a natural language instruction using a neural sequence-to-sequence model that learns to generate free-form instructions from natural language corpora. We evaluate our method on a benchmark route instruction dataset and achieve a BLEU score of 72.18% when compared to human-generated reference instructions. We additionally conduct navigation experiments with human participants that demonstrate that our method generates instructions that people follow as accurately and easily as those produced by humans.

URL: https://arxiv.org/abs/1610.03164

Notes:

From phonemes to images: levels of representation in a recurrent neural model of visually-grounded language learning

Authors: Lieke Gelderloos, Grzegorz Chrupała

Abstract: We present a model of visually-grounded language learning based on stacked gated recurrent neural networks which learns to predict visual features given an image description in the form of a sequence of phonemes. The learning task resembles that faced by human language learners who need to discover both structure and meaning from noisy and ambiguous data across modalities. We show that our model indeed learns to predict features of the visual context given phonetically transcribed image descriptions, and show that it represents linguistic information in a hierarchy of levels: lower layers in the stack are comparatively more sensitive to form, whereas higher layers are more sensitive to meaning.

URL: https://arxiv.org/abs/1610.03342

Notes:

Keystroke dynamics as signal for shallow syntactic parsing

Authors: Barbara Plank

Abstract: Keystroke dynamics have been extensively used in psycholinguistic and writing research to gain insights into cognitive processing. But do keystroke logs contain actual signal that can be used to learn better natural language processing models? We postulate that keystroke dynamics contain information about syntactic structure that can inform shallow syntactic parsing. To test this hypothesis, we explore labels derived from keystroke logs as auxiliary task in a multi-task bidirectional Long Short-Term Memory (bi-LSTM). Our results show promising results on two shallow syntactic parsing tasks, chunking and CCG supertagging. Our model is simple, has the advantage that data can come from distinct sources, and produces models that are significantly better than models trained on the text annotations alone.

URL: https://arxiv.org/abs/1610.03321

Notes:

An Empirical Exploration of Skip Connections for Sequential Tagging

Authors: Huijia Wu, Jiajun Zhang, Chengqing Zong

Abstract: In this paper, we empirically explore the effects of various kinds of skip connections in stacked bidirectional LSTMs for sequential tagging. We investigate three kinds of skip connections connecting to LSTM cells: (a) skip connections to the gates, (b) skip connections to the internal states and (c) skip connections to the cell outputs. We present comprehensive experiments showing that skip connections to cell outputs outperform the remaining two. Furthermore, we observe that using gated identity functions as skip mappings works pretty well. Based on this novel skip connections, we successfully train deep stacked bidirectional LSTM models and obtain state-of-the-art results on CCG supertagging and comparable results on POS tagging.

URL: https://arxiv.org/abs/1610.03167

Notes:

GMM-Free Flat Start Sequence-Discriminative DNN Training

Authors: Gábor Gosztolya, Tamás Grósz, László Tóth

Abstract: Recently, attempts have been made to remove Gaussian mixture models (GMM) from the training process of deep neural network-based hidden Markov models (HMM/DNN). For the GMM-free training of a HMM/DNN hybrid we have to solve two problems, namely the initial alignment of the frame-level state labels and the creation of context-dependent states. Although flat-start training via iteratively realigning and retraining the DNN using a frame-level error function is viable, it is quite cumbersome. Here, we propose to use a sequence-discriminative training criterion for flat start. While sequence-discriminative training is routinely applied only in the final phase of model training, we show that with proper caution it is also suitable for getting an alignment of context-independent DNN models. For the construction of tied states we apply a recently proposed KL-divergence-based state clustering method, hence our whole training process is GMM-free. In the experimental evaluation we found that the sequence-discriminative flat start training method is not only significantly faster than the straightforward approach of iterative retraining and realignment, but the word error rates attained are slightly better as well.

URL: https://arxiv.org/abs/1610.03256

Notes:

Leveraging Recurrent Neural Networks for Multimodal Recognition of Social Norm Violation in Dialog

Authors: Tiancheng Zhao, Ran Zhao, Zhao Meng, Justine Cassell

Abstract: Social norms are shared rules that govern and facilitate social interaction. Violating such social norms via teasing and insults may serve to upend power imbalances or, on the contrary reinforce solidarity and rapport in conversation, rapport which is highly situated and context-dependent. In this work, we investigate the task of automatically identifying the phenomena of social norm violation in discourse. Towards this goal, we leverage the power of recurrent neural networks and multimodal information present in the interaction, and propose a predictive model to recognize social norm violation. Using long-term temporal and contextual information, our model achieves an F1 score of 0.705. Implications of our work regarding developing a social-aware agent are discussed.

URL: https://arxiv.org/abs/1610.03112

Notes:

Long Short-Term Memory based Convolutional Recurrent Neural Networks for Large Vocabulary Speech Recognition

Authors: Xiangang Li, Xihong Wu

Abstract: Long short-term memory (LSTM) recurrent neural networks (RNNs) have been shown to give state-of-the-art performance on many speech recognition tasks, as they are able to provide the learned dynamically changing contextual window of all sequence history. On the other hand, the convolutional neural networks (CNNs) have brought significant improvements to deep feed-forward neural networks (FFNNs), as they are able to better reduce spectral variation in the input signal. In this paper, a network architecture called as convolutional recurrent neural network (CRNN) is proposed by combining the CNN and LSTM RNN. In the proposed CRNNs, each speech frame, without adjacent context frames, is organized as a number of local feature patches along the frequency axis, and then a LSTM network is performed on each feature patch along the time axis. We train and compare FFNNs, LSTM RNNs and the proposed LSTM CRNNs at various number of configurations. Experimental results show that the LSTM CRNNs can exceed state-of-the-art speech recognition performance.

URL: https://arxiv.org/abs/1610.03165

Notes:

A Language-independent and Compositional Model for Personality Trait Recognition from Short Texts

Authors: Fei Liu, Julien Perez, Scott Nowson

Abstract: Many methods have been used to recognize author personality traits from text, typically combining linguistic feature engineering with shallow learning models, e.g. linear regression or Support Vector Machines. This work uses deep-learning-based models and atomic features of text, the characters, to build hierarchical, vectorial word and sentence representations for trait inference. This method, applied to a corpus of tweets, shows state-of-the-art performance across five traits and three languages (English, Spanish and Italian) compared with prior work in author profiling. The results, supported by preliminary visualisation work, are encouraging for the ability to detect complex human traits.

URL: https://arxiv.org/abs/1610.04345

Notes: похоже на PersonRNN

Pre-Translation for Neural Machine Translation

Authors: Jan Niehues, Eunah Cho, Thanh-Le Ha, Alex Waibel

Abstract: Recently, the development of neural machine translation (NMT) has significantly improved the translation quality of automatic machine translation. While most sentences are more accurate and fluent than translations by statistical machine translation (SMT)-based systems, in some cases, the NMT system produces translations that have a completely different meaning. This is especially the case when rare words occur. When using statistical machine translation, it has already been shown that significant gains can be achieved by simplifying the input in a preprocessing step. A commonly used example is the pre-reordering approach. In this work, we used phrase-based machine translation to pre-translate the input into the target language. Then a neural machine translation system generates the final hypothesis using the pre-translation. Thereby, we use either only the output of the phrase-based machine translation (PBMT) system or a combination of the PBMT output and the source sentence. We evaluate the technique on the English to German translation task. Using this approach we are able to outperform the PBMT system as well as the baseline neural MT system by up to 2 BLEU points. We analyzed the influence of the quality of the initial system on the final result.

URL: https://arxiv.org/abs/1610.05243

Notes:

Neural Machine Translation Advised by Statistical Machine Translation

Authors: Xing Wang, Zhengdong Lu, Zhaopeng Tu, Hang Li, Deyi Xiong, Min Zhang

Abstract: Neural Machine Translation (NMT) is a new approach to machine translation that has made great progress in recent years. However, recent studies show that NMT generally produces fluent but inadequate translations (Tu et al. 2016; He et al. 2016). This is in contrast to conventional Statistical Machine Translation (SMT), which usually yields adequate but non-fluent translations. It is natural, therefore, to leverage the advantages of both models for better translations, and in this work we propose to incorporate SMT model into NMT framework. More specifically, at each decoding step, SMT offers additional recommendations of generated words based on the decoding information from NMT (e.g., the generated partial translation and attention history). Then we employ an auxiliary classifier to score the SMT recommendations and a gating function to combine the SMT recommendations with NMT generations, both of which are jointly trained within the NMT architecture in an end-to-end manner. Experimental results on Chinese-English translation show that the proposed approach achieves significant and consistent improvements over state-of the-art NMT and SMT systems on multiple NIST test sets.

URL: https://arxiv.org/abs/1610.05150

Notes:

Interactive Attention for Neural Machine Translation

Authors: Fandong Meng, Zhengdong Lu, Hang Li, Qun Liu

Abstract: Conventional attention-based Neural Machine Translation (NMT) conducts dynamic alignment in generating the target sentence. By repeatedly reading the representation of source sentence, which keeps fixed after generated by the encoder (Bahdanau et al., 2015), the attention mechanism has greatly enhanced state-of-the-art NMT. In this paper, we propose a new attention mechanism, called INTERACTIVE ATTENTION, which models the interaction between the decoder and the representation of source sentence during translation by both reading and writing operations. INTERACTIVE ATTENTION can keep track of the interaction history and therefore improve the translation performance. Experiments on NIST Chinese-English translation task show that INTERACTIVE ATTENTION can achieve significant improvements over both the previous attention-based NMT baseline and some state-of-the-art variants of attention-based NMT (i.e., coverage models (Tu et al., 2016)). And neural machine translator with our INTERACTIVE ATTENTION can outperform the open source attention-based NMT system Groundhog by 4.22 BLEU points and the open source phrase-based system Moses by 3.94 BLEU points averagely on multiple test sets.

URL: https://arxiv.org/abs/1610.05011

Notes:

Translation Quality Estimation using Recurrent Neural Network

Authors: Raj Nath Patel, Sasikumar M

Abstract: This paper describes our submission to the shared task on word/phrase level Quality Estimation (QE) in the First Conference on Statistical Machine Translation (WMT16). The objective of the shared task was to predict if the given word/phrase is a correct/incorrect (OK/BAD) translation in the given sentence. In this paper, we propose a novel approach for word level Quality Estimation using Recurrent Neural Network Language Model (RNN-LM) architecture. RNN-LMs have been found very effective in different Natural Language Processing (NLP) applications. RNN-LM is mainly used for vector space language modeling for different NLP problems. For this task, we modify the architecture of RNN-LM. The modified system predicts a label (OK/BAD) in the slot rather than predicting the word. The input to the system is a word sequence, similar to the standard RNN-LM. The approach is language independent and requires only the translated text for QE. To estimate the phrase level quality, we use the output of the word level QE system.

URL: https://arxiv.org/abs/1610.04841

Notes:

Cached Long Short-Term Memory Neural Networks for Document-Level Sentiment Classification

Authors: Jiacheng Xu, Danlu Chen, Xipeng Qiu, Xuangjing Huang

Abstract: Recently, neural networks have achieved great success on sentiment classification due to their ability to alleviate feature engineering. However, one of the remaining challenges is to model long texts in document-level sentiment classification under a recurrent architecture because of the deficiency of the memory unit. To address this problem, we present a Cached Long Short-Term Memory neural networks (CLSTM) to capture the overall semantic information in long texts. CLSTM introduces a cache mechanism, which divides memory into several groups with different forgetting rates and thus enables the network to keep sentiment information better within a recurrent unit. The proposed CLSTM outperforms the state-of-the-art models on three publicly available document-level sentiment analysis datasets.

URL: https://arxiv.org/abs/1610.04989

Notes:

Simultaneous Learning of Trees and Representations for Extreme Classification, with Application to Language Modeling

Authors: Yacine Jernite, Anna Choromanska, David Sontag, Yann LeCun

Abstract: This paper addresses the problem of multi-class classification with an extremely large number of classes, where the class predictor is learned jointly with the data representation, as is the case in language modeling problems. The predictor admits a hierarchical structure, which allows for efficient handling of settings that deal with a very large number of labels. The predictive power of the model however can heavily depend on the structure of the tree. We address this problem with an algorithm for tree construction and training that is based on a new objective function which favors balanced and easily-separable node partitions. We describe theoretical properties of this objective function and show that it gives rise to a boosting algorithm for which we provide a bound on classification error, i.e. we show that if the objective is weakly optimized in the internal nodes of the tree, then our algorithm will amplify this weak advantage to build a tree achieving any desired level of accuracy. We apply the algorithm to the task of language modeling by re-framing conditional density estimation as a variant of the hierarchical classification problem. We empirically demonstrate on text data that the proposed approach leads to high-quality trees in terms of perplexity and computational running time compared to its non-hierarchical counterpart.

URL: https://arxiv.org/abs/1610.04658

Notes:

Towards K-means-friendly Spaces: Simultaneous Deep Learning and Clustering

Authors: Bo Yang, Xiao Fu, Nicholas D. Sidiropoulos, Mingyi Hong

Abstract: Most learning approaches treat dimensionality reduction (DR) and clustering separately (i.e., sequentially), but recent research has shown that optimizing the two tasks jointly can substantially improve the performance of both. The premise behind the latter genre is that the data samples are obtained via linear transformation of latent representations that are easy to cluster; but in practice, the transformation from the latent space to the data can be more complicated. In this work, we assume that this transformation is an unknown and possibly nonlinear function. To recover the 'clustering-friendly' latent representations and to better cluster the data, we propose a joint DR and K-means clustering approach in which DR is accomplished via learning a deep neural network (DNN). The motivation is to keep the advantages of jointly optimizing the two tasks, while exploiting the deep neural network's ability to approximate any nonlinear function. This way, the proposed approach can work well for a broad class of generative models. Towards this end, we carefully design the DNN structure and the associated joint optimization criterion, and propose an effective and scalable algorithm to handle the formulated optimization problem. Experiments using five different real datasets are employed to showcase the effectiveness of the proposed approach.

URL: https://arxiv.org/abs/1610.04794

Notes:

Reasoning with Memory Augmented Neural Networks for Language Comprehension

Authors: Tsendsuren Munkhdalai, Hong Yu

Abstract: Hypothesis testing is an important cognitive process that supports human reasoning. In this paper, we introduce a computational hypothesis testing approach based on memory augmented neural networks. Our approach involves a hypothesis testing loop that reconsiders and progressively refines a previously formed hypothesis in order to generate new hypotheses to test. We apply the proposed approach to language comprehension task by using Neural Semantic Encoders (NSE). Our NSE models achieve the state-of-the-art results showing an absolute improvement of 1.2% to 2.6% accuracy over previous results obtained by single and ensemble systems on standard machine comprehension benchmarks such as the Children's Book Test (CBT) and Who-Did-What (WDW) news article datasets.

URL: https://arxiv.org/abs/1610.06454

Notes:

Jointly Learning to Align and Convert Graphemes to Phonemes with Neural Attention Models

Authors: Shubham Toshniwal, Karen Livescu

Abstract: We propose an attention-enabled encoder-decoder model for the problem of grapheme-to-phoneme conversion. Most previous work has tackled the problem via joint sequence models that require explicit alignments for training. In contrast, the attention-enabled encoder-decoder model allows for jointly learning to align and convert characters to phonemes. We explore different types of attention models, including global and local attention, and our best models achieve state-of-the-art results on three standard data sets (CMUDict, Pronlex, and NetTalk).

URL: https://arxiv.org/abs/1610.06540

Notes:

Lexicon Integrated CNN Models with Attention for Sentiment Analysis

Authors: Bonggun Shin, Timothy Lee, Jinho D. Choi

Abstract: With the advent of word embeddings, lexicons are no longer fully utilized for sentiment analysis although they still provide important features in the traditional setting. This paper introduces a novel approach to sentiment analysis that integrates lexicon embeddings and an attention mechanism into Convolutional Neural Networks. Our approach performs separate convolutions for word and lexicon embeddings and provides a global view of the document using attention. Our models are experimented on both the SemEval'16 Task 4 dataset and the Stanford Sentiment Treebank, and show comparative or better results against the existing state-of-the-art systems. Our analysis shows that lexicon embeddings allow to build high-performing models with much smaller word embeddings, and the attention mechanism effectively dims out noisy words for sentiment analysis.

URL: https://arxiv.org/abs/1610.06272

Notes:

Neural Machine Translation with Characters and Hierarchical Encoding

Authors: Alexander Rosenberg Johansen, Jonas Meinertz Hansen, Elias Khazen Obeid, Casper Kaae Sønderby, Ole Winther

Abstract: Most existing Neural Machine Translation models use groups of characters or whole words as their unit of input and output. We propose a model with a hierarchical char2word encoder, that takes individual characters both as input and output. We first argue that this hierarchical representation of the character encoder reduces computational complexity, and show that it improves translation performance. Secondly, by qualitatively studying attention plots from the decoder we find that the model learns to compress common words into a single embedding whereas rare words, such as names and places, are represented character by character.

URL: https://arxiv.org/abs/1610.06550

Notes:

Lexicons and Minimum Risk Training for Neural Machine Translation: NAIST-CMU at WAT2016

Authors: Graham Neubig

Abstract: This year, the Nara Institute of Science and Technology (NAIST)/Carnegie Mellon University (CMU) submission to the Japanese-English translation track of the 2016 Workshop on Asian Translation was based on attentional neural machine translation (NMT) models. In addition to the standard NMT model, we make a number of improvements, most notably the use of discrete translation lexicons to improve probability estimates, and the use of minimum risk training to optimize the MT system for BLEU score. As a result, our system achieved the highest translation evaluation scores for the task.

URL: https://arxiv.org/abs/1610.06542

Notes:

Clinical Text Prediction with Numerically Grounded Conditional Language Models

Authors: Georgios P. Spithourakis, Steffen E. Petersen, Sebastian Riedel

Abstract: Assisted text input techniques can save time and effort and improve text quality. In this paper, we investigate how grounded and conditional extensions to standard neural language models can bring improvements in the tasks of word prediction and completion. These extensions incorporate a structured knowledge base and numerical values from the text into the context used to predict the next word. Our automated evaluation on a clinical dataset shows extended models significantly outperform standard models. Our best system uses both conditioning and grounding, because of their orthogonal benefits. For word prediction with a list of 5 suggestions, it improves recall from 25.03% to 71.28% and for word completion it improves keystroke savings from 34.35% to 44.81%, where theoretical bound for this dataset is 58.78%. We also perform a qualitative investigation of how models with lower perplexity occasionally fare better at the tasks. We found that at test time numbers have more influence on the document level than on individual word probabilities.

URL: https://arxiv.org/abs/1610.06370

Notes:

Using Fast Weights to Attend to the Recent Past

Authors: Jimmy Ba, Geoffrey Hinton, Volodymyr Mnih, Joel Z. Leibo, Catalin Ionescu

Abstract: Until recently, research on artificial neural networks was largely restricted to systems with only two types of variable: Neural activities that represent the current or recent input and weights that learn to capture regularities among inputs, outputs and payoffs. There is no good reason for this restriction. Synapses have dynamics at many different time-scales and this suggests that artificial neural networks might benefit from variables that change slower than activities but much faster than the standard weights. These "fast weights" can be used to store temporary memories of the recent past and they provide a neurally plausible way of implementing the type of attention to the past that has recently proved very helpful in sequence-to-sequence models. By using fast weights we can avoid the need to store copies of neural activity patterns.

URL: https://arxiv.org/abs/1610.06258

Notes:

Socratic Learning

Authors: Rose Yu, Paroma Varma, Dan Iter, Christopher De Sa, Christopher Ré

Abstract: Modern machine learning techniques, such as deep learning, often use discriminative models that require large amounts of labeled data. An alternative approach is to use a generative model, which leverages heuristics from domain experts to train on unlabeled data. Domain experts often prefer to use generative models because they "tell a story" about their data. Unfortunately, generative models are typically less accurate than discriminative models. Several recent approaches combine both types of model to exploit their strengths. In this setting, a misspecified generative model can hurt the performance of subsequent discriminative training. To address this issue, we propose a framework called Socratic learning that automatically uses information from the discriminative model to correct generative model misspecification. Furthermore, this process provides users with interpretable feedback about how to improve their generative model. We evaluate Socratic learning on real-world relation extraction tasks and observe an immediate improvement in classification accuracy that could otherwise require several weeks of effort by domain experts.

URL: https://arxiv.org/abs/1610.08123

Notes: Интересный подход к комбинации генеративных и дискриминативных моделей.

Distraction-Based Neural Networks for Document Summarization

Authors: Qian Chen, Xiaodan Zhu, Zhenhua Ling, Si Wei, Hui Jiang

Abstract: Distributed representation learned with neural networks has recently shown to be effective in modeling natural languages at fine granularities such as words, phrases, and even sentences. Whether and how such an approach can be extended to help model larger spans of text, e.g., documents, is intriguing, and further investigation would still be desirable. This paper aims to enhance neural network models for such a purpose. A typical problem of document-level modeling is automatic summarization, which aims to model documents in order to generate summaries. In this paper, we propose neural models to train computers not just to pay attention to specific regions and content of input documents with attention models, but also distract them to traverse between different content of a document so as to better grasp the overall meaning for summarization. Without engineering any features, we train the models on two large datasets. The models achieve the state-of-the-art performance, and they significantly benefit from the distraction modeling, particularly when input documents are long.

URL: https://arxiv.org/abs/1610.08462

Notes: Авторы предлагают не просто склеивать кусочки текста оригинального документа, но добиваться плавного перехода между ними, засчет чего получать осмысленную суммаризацию.

Broad Context Language Modeling as Reading Comprehension

Authors: Zewei Chu, Hai Wang, Kevin Gimpel, David McAllester

Abstract: Progress in text understanding has been driven by the availability of large datasets that test particular capabilities, like recent datasets for assessing reading comprehension. We focus here on the LAMBADA dataset, a word prediction task requiring broader context than the immediate sentence. We view the LAMBADA task as a reading comprehension problem and apply off-the-shelf comprehension models based on neural networks. Though these models are constrained to choose a word from the context, they improve the state of the art on LAMBADA from 7.3% to 45.4%. We analyze 100 instances, finding that neural network readers perform well in cases that involve selecting a name from the context based on dialogue or discourse cues but struggle when coreference resolution or external knowledge is needed.

URL: https://arxiv.org/abs/1610.08431

Notes: Анализируют проблемы кореференции для нейромоделей.

Word Embeddings and Their Use In Sentence Classification Tasks

Authors: Amit Mandelbaum, Adi Shalev

Abstract: This paper have two parts. In the first part we discuss word embeddings. We discuss the need for them, some of the methods to create them, and some of their interesting properties. We also compare them to image embeddings and see how word embedding and image embedding can be combined to perform different tasks. In the second part we implement a convolutional neural network trained on top of pre-trained word vectors. The network is used for several sentence-level classification tasks, and achieves state-of-art (or comparable) results, demonstrating the great power of pre-trainted word embeddings over random ones.

URL: https://arxiv.org/abs/1610.08229

Notes: if pretrained word vectores are better that random in sentence classification (spoiler: YES)

A Deeper Look into Sarcastic Tweets Using Deep Convolutional Neural Networks

Authors: Soujanya Poria, Erik Cambria, Devamanyu Hazarika, Prateek Vij

Abstract: Sarcasm detection is a key task for many natural language processing tasks. In sentiment analysis, for example, sarcasm can flip the polarity of an "apparently positive" sentence and, hence, negatively affect polarity detection performance. To date, most approaches to sarcasm detection have treated the task primarily as a text categorization problem. Sarcasm, however, can be expressed in very subtle ways and requires a deeper understanding of natural language that standard text categorization techniques cannot grasp. In this work, we develop models based on a pre-trained convolutional neural network for extracting sentiment, emotion and personality features for sarcasm detection. Such features, along with the network's baseline features, allow the proposed models to outperform the state of the art on benchmark datasets. We also address the often ignored generalizability issue of classifying data that have not been seen by the models at learning phase.

URL: https://arxiv.org/abs/1610.08815

Notes: Больше про сарказм, теперь на сверточных сетях

Word Embeddings to Enhance Twitter Gang Member Profile Identification

Authors: Sanjaya Wijeratne, Lakshika Balasuriya, Derek Doran, Amit Sheth

Abstract: Gang affiliates have joined the masses who use social media to share thoughts and actions publicly. Interestingly, they use this public medium to express recent illegal actions, to intimidate others, and to share outrageous images and statements. Agencies able to unearth these profiles may thus be able to anticipate, stop, or hasten the investigation of gang-related crimes. This paper investigates the use of word embeddings to help identify gang members on Twitter. Building on our previous work, we generate word embeddings that translate what Twitter users post in their profile descriptions, tweets, profile images, and linked YouTube content to a real vector format amenable for machine learning classification. Our experimental results show that pre-trained word embeddings can boost the accuracy of supervised learning algorithms trained over gang members social media posts.

URL: https://arxiv.org/abs/1610.08597

Notes: Классификация пользователей твиттера по генерируемому контекту, видео, профилям c помощью эмбеддингов.

Professor Forcing: A New Algorithm for Training Recurrent Networks

Authors: Alex Lamb, Anirudh Goyal, Ying Zhang, Saizheng Zhang, Aaron Courville, Yoshua Bengio

Abstract: The Teacher Forcing algorithm trains recurrent networks by supplying observed sequence values as inputs during training and using the network's own one-step-ahead predictions to do multi-step sampling. We introduce the Professor Forcing algorithm, which uses adversarial domain adaptation to encourage the dynamics of the recurrent network to be the same when training the network and when sampling from the network over multiple time steps. We apply Professor Forcing to language modeling, vocal synthesis on raw waveforms, handwriting generation, and image generation. Empirically we find that Professor Forcing acts as a regularizer, improving test likelihood on character level Penn Treebank and sequential MNIST. We also find that the model qualitatively improves samples, especially when sampling for a large number of time steps. This is supported by human evaluation of sample quality. Trade-offs between Professor Forcing and Scheduled Sampling are discussed. We produce T-SNEs showing that Professor Forcing successfully makes the dynamics of the network during training and sampling more similar.

URL: https://arxiv.org/abs/1610.09038

Notes: Развитие идеи Scheduled Sampling

Scaling Memory-Augmented Neural Networks with Sparse Reads and Writes

Authors: Jack W Rae, Jonathan J Hunt, Tim Harley, Ivo Danihelka, Andrew Senior, Greg Wayne, Alex Graves, Timothy P Lillicrap

Abstract: Neural networks augmented with external memory have the ability to learn algorithmic solutions to complex tasks. These models appear promising for applications such as language modeling and machine translation. However, they scale poorly in both space and time as the amount of memory grows —- limiting their applicability to real-world domains. Here, we present an end-to-end differentiable memory access scheme, which we call Sparse Access Memory (SAM), that retains the representational power of the original approaches whilst training efficiently with very large memories. We show that SAM achieves asymptotic lower bounds in space and time complexity, and find that an implementation runs 1,000× faster and with 3,000× less physical memory than non-sparse models. SAM learns with comparable data efficiency to existing models on a range of synthetic tasks and one-shot Omniglot character recognition, and can scale to tasks requiring 100,000s of time steps and memories. As well, we show how our approach can be adapted for models that maintain temporal associations between memories, as with the recently introduced Differentiable Neural Computer.

URL: https://arxiv.org/abs/1610.09027

Notes: Новый подход к памяти, нужно разобраться дальше, но это похоже на т.н. рабочую память у человека.

Representation Learning Models for Entity Search

Authors: Shijia E, Yang Xiang, Mohan Zhang

Abstract: We focus on the problem of learning distributed representations for entity search queries, named entities, and their short descriptions. With our representation learning models, the entity search query, named entity and description can be represented as low-dimensional vectors. Our goal is to develop a simple but effective model that can make the distributed representations of query related entities similar to the query in the vector space. Hence, we propose three kinds of learning strategies, and the difference between them mainly lies in how to deal with the relationship between an entity and its description. We analyze the strengths and weaknesses of each learning strategy and validate our methods on public datasets which contain four kinds of named entities, i.e., movies, TV shows, restaurants and celebrities. The experimental results indicate that our proposed methods can adapt to different types of entity search queries, and outperform the current state-of-the-art methods based on keyword matching and vanilla word2vec models. Besides, the proposed methods can be trained fast and be easily extended to other similar tasks.

URL: https://arxiv.org/abs/1610.09091

Notes: Сравнение разных подходов к созданию векторых представлений для entities.

RNN Approaches to Text Normalization: A Challenge

Authors: Richard Sproat, Navdeep Jaitly

Abstract: This paper presents a challenge to the community: given a large corpus of written text aligned to its normalized spoken form, train an RNN to learn the correct normalization function. We present a data set of general text where the normalizations were generated using an existing text normalization component of a text-to-speech system. This data set will be released open-source in the near future. We also present our own experiments with this data set with a variety of different RNN architectures. While some of the architectures do in fact produce very good results when measured in terms of overall accuracy, the errors that are produced are problematic, since they would convey completely the wrong message if such a system were deployed in a speech application. On the other hand, we show that a simple FST-based filter can mitigate those errors, and achieve a level of accuracy not achievable by the RNN alone. Though our conclusions are largely negative on this point, we are actually not arguing that the text normalization problem is intractable using an pure RNN approach, merely that it is not going to be something that can be solved merely by having huge amounts of annotated text data and feeding that to a general RNN model. And when we open-source our data, we will be providing a novel data set for sequence-to-sequence modeling in the hopes that the the community can find better solutions.

URL: https://arxiv.org/abs/1611.00068

Notes: Может быть интересно в рамках нашей идеи о произношении и WaveNet для текстов

Recurrent Neural Network Language Model Adaptation Derived Document Vector

Authors: Wei Li, Brian Kan, Wing Mak

Abstract: In many natural language processing (NLP) tasks, a document is commonly modeled as a bag of words using the term frequency-inverse document frequency (TF-IDF) vector. One major shortcoming of the frequency-based TF-IDF feature vector is that it ignores word orders that carry syntactic and semantic relationships among the words in a document, and they can be important in some NLP tasks such as genre classification. This paper proposes a novel distributed vector representation of a document: a simple recurrent-neural-network language model (RNN-LM) or a long short-term memory RNN language model (LSTM-LM) is first created from all documents in a task; some of the LM parameters are then adapted by each document, and the adapted parameters are vectorized to represent the document. The new document vectors are labeled as DV-RNN and DV-LSTM respectively. We believe that our new document vectors can capture some high-level sequential information in the documents, which other current document representations fail to capture. The new document vectors were evaluated in the genre classification of documents in three corpora: the Brown Corpus, the BNC Baby Corpus and an artificially created Penn Treebank dataset. Their classification performances are compared with the performance of TF-IDF vector and the state-of-the-art distributed memory model of paragraph vector (PV-DM). The results show that DV-LSTM significantly outperforms TF-IDF and PV-DM in most cases, and combinations of the proposed document vectors with TF-IDF or PV-DM may further improve performance.

URL: https://arxiv.org/abs/1611.00196

Notes: Попытка уйти от BoW модели c помощью LSTM и делать хорошие эмбеддинги документов.

Dual Learning for Machine Translation

Authors: Yingce Xia, Di He, Tao Qin, Liwei Wang, Nenghai Yu, Tie-Yan Liu, Wei-Ying Ma

Abstract: While neural machine translation (NMT) is making good progress in the past two years, tens of millions of bilingual sentence pairs are needed for its training. However, human labeling is very costly. To tackle this training data bottleneck, we develop a dual-learning mechanism, which can enable an NMT system to automatically learn from unlabeled data through a dual-learning game. This mechanism is inspired by the following observation: any machine translation task has a dual task, e.g., English-to-French translation (primal) versus French-to-English translation (dual); the primal and dual tasks can form a closed loop, and generate informative feedback signals to train the translation models, even if without the involvement of a human labeler. In the dual-learning mechanism, we use one agent to represent the model for the primal task and the other agent to represent the model for the dual task, then ask them to teach each other through a reinforcement learning process. Based on the feedback signals generated during this process (e.g., the language-model likelihood of the output of a model, and the reconstruction error of the original sentence after the primal and dual translations), we can iteratively update the two models until convergence (e.g., using the policy gradient methods). We call the corresponding approach to neural machine translation \emph{dual-NMT}. Experiments show that dual-NMT works very well on English↔French translation; especially, by learning from monolingual data (with 10% bilingual data for warm start), it achieves a comparable accuracy to NMT trained from the full bilingual data for the French-to-English translation task.

URL: https://arxiv.org/abs/1611.00179

Notes: Применение идеи самообучения к NMT.

Improving Twitter Sentiment Classification via Multi-Level Sentiment-Enriched Word Embeddings

Authors: Shufeng Xiong

Abstract: Most of existing work learn sentiment-specific word representation for improving Twitter sentiment classification, which encoded both n-gram and distant supervised tweet sentiment information in learning process. They assume all words within a tweet have the same sentiment polarity as the whole tweet, which ignores the word its own sentiment polarity. To address this problem, we propose to learn sentiment-specific word embedding by exploiting both lexicon resource and distant supervised information. We develop a multi-level sentiment-enriched word embedding learning method, which uses parallel asymmetric neural network to model n-gram, word level sentiment and tweet level sentiment in learning process. Experiments on standard benchmarks show our approach outperforms state-of-the-art methods.

URL: https://arxiv.org/abs/1611.00126

Notes:

Deep Model Compression: Distilling Knowledge from Noisy Teachers

Authors: Bharat Bhusan Sau, Vineeth N. Balasubramanian

Abstract: The remarkable successes of deep learning models across various applications have resulted in the design of deeper networks that can solve complex problems. However, the increasing depth of such models also results in a higher storage and runtime complexity, which restricts the deployability of such very deep models on mobile and portable devices, which have limited storage and battery capacity. While many methods have been proposed for deep model compression in recent years, almost all of them have focused on reducing storage complexity. In this work, we extend the teacher-student framework for deep model compression, since it has the potential to address runtime and train time complexity too. We propose a simple methodology to include a noise-based regularizer while training the student from the teacher, which provides a healthy improvement in the performance of the student network. Our experiments on the CIFAR-10, SVHN and MNIST datasets show promising improvement, with the best performance on the CIFAR-10 dataset. We also conduct a comprehensive empirical evaluation of the proposed method under related settings on the CIFAR-10 dataset to show the promise of the proposed approach.

URL: https://arxiv.org/abs/1610.09650

Notes: Добавление шума в обучение c дистилляцией знаний. ME: Обучение c учителем и шумный environment.

Neural Machine Translation in Linear Time

Authors: Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan, Aaron van den Oord, Alex Graves, Koray Kavukcuoglu

Abstract: We present a neural architecture for sequence processing. The ByteNet is a stack of two dilated convolutional neural networks, one to encode the source sequence and one to decode the target sequence, where the target network unfolds dynamically to generate variable length outputs. The ByteNet has two core properties: it runs in time that is linear in the length of the sequences and it preserves the sequences' temporal resolution. The ByteNet decoder attains state-of-the-art performance on character-level language modelling and outperforms the previous best results obtained with recurrent neural networks. The ByteNet also achieves a performance on raw character-level machine translation that approaches that of the best neural translation models that run in quadratic time. The implicit structure learnt by the ByteNet mirrors the expected alignments between the sequences.

URL: https://arxiv.org/abs/1610.10099

Notes: Ускорение засчет того, что оптимизируем до байтов. Последователи WaveNet.

Sequence-to-sequence neural network models for transliteration

Authors: Mihaela Rosca, Thomas Breuel

Abstract: Transliteration is a key component of machine translation systems and software internationalization. This paper demonstrates that neural sequence-to-sequence models obtain state of the art or close to state of the art results on existing datasets. In an effort to make machine transliteration accessible, we open source a new Arabic to English transliteration dataset and our trained models.

URL: https://arxiv.org/abs/1610.09565

Notes: Армяне делали что-то подобное.

2016-11

Detecting Context Dependent Messages in a Conversational Environment

Authors: Chaozhuo Li, Yu Wu, Wei Wu, Chen Xing, Zhoujun Li, Ming Zhou

Abstract: While automatic response generation for building chatbot systems has drawn a lot of attention recently, there is limited understanding on when we need to consider the linguistic context of an input text in the generation process. The task is challenging, as messages in a conversational environment are short and informal, and evidence that can indicate a message is context dependent is scarce. After a study of social conversation data crawled from the web, we observed that some characteristics estimated from the responses of messages are discriminative for identifying context dependent messages. With the characteristics as weak supervision, we propose using a Long Short Term Memory (LSTM) network to learn a classifier. Our method carries out text representation and classifier learning in a unified framework. Experimental results show that the proposed method can significantly outperform baseline methods on accuracy of classification.

URL: https://arxiv.org/abs/1611.00483

Notes: Используют контекст для классификации сообщений при генерации ответов.

Ordinal Common-sense Inference

Authors: Sheng Zhang, Rachel Rudinger, Kevin Duh, Benjamin Van Durme

Abstract: Humans have the capacity to draw common-sense inferences from natural language: various things that are likely but not certain to hold based on established discourse, and are rarely stated explicitly. We propose an evaluation of automated common-sense inference based on an extension of recognizing textual entailment: predicting ordinal human responses of subjective likelihood of an inference holding in a given context. We describe a framework for extracting common-sense knowledge for corpora, which is then used to construct a dataset for this ordinal entailment task, which we then use to train and evaluate a sequence to sequence neural network model. Further, we annotate subsets of previously established datasets via our ordinal annotation protocol in order to then analyze the distinctions between these and what we have constructed.

URL: https://arxiv.org/abs/1611.00601

Notes: Люди пытаются учиться делать выводы и проверять себя, все на естественном языке.

Unsupervised Learning of Sentence Representations using Convolutional Neural Networks

Authors: Zhe Gan, Yunchen Pu, Ricardo Henao, Chunyuan Li, Xiaodong He, Lawrence Carin

Abstract: We propose a new encoder-decoder approach to learn distributed sentence representations from unlabeled sentences. The word-to-vector representation is used, and convolutional neural networks are employed as sentence encoders, mapping an input sentence into a fixed-length vector. This representation is decoded using long short-term memory recurrent neural networks, considering several tasks, such as reconstructing the input sentence, or predicting the future sentence. We further describe a hierarchical encoder-decoder model to encode a sentence to predict multiple future sentences. By training our models on a large collection of novels, we obtain a highly generic convolutional sentence encoder that performs well in practice. Experimental results on several benchmark datasets, and across a broad range of applications, demonstrate the superiority of the proposed model over competing methods.

URL: https://arxiv.org/abs/1611.07897

Notes: Надо попробовать, возможно в виде вариации архитектуры Дениса

Learning a Natural Language Interface with Neural Programmer

Authors: Arvind Neelakantan, Quoc V. Le, Martin Abadi, Andrew McCallum, Dario Amodei

Abstract: Learning a natural language interface for database tables is a challenging task that involves deep language understanding and multi-step reasoning. The task is often approached by mapping natural language queries to logical forms or programs that provide the desired response when executed on the database. To our knowledge, this paper presents the first weakly supervised, end-to-end neural network model to induce such programs on a real-world dataset. We enhance the objective function of Neural Programmer, a neural network with built-in discrete operations, and apply it on WikiTableQuestions, a natural language question-answering dataset. The model is trained end-to-end with weak supervision of question-answer pairs, and does not require domain-specific grammars, rules, or annotations that are key elements in previous approaches to program induction. The main experimental result in this paper is that a single Neural Programmer model achieves 34.2% accuracy using only 10,000 examples with weak supervision. An ensemble of 15 models, with a trivial combination technique, achieves 37.2% accuracy, which is competitive to the current state-of-the-art accuracy of 37.1% obtained by a traditional natural language semantic parser.

URL: https://arxiv.org/abs/1611.08945

Notes: На замену слотовым диалоговым системам

Attention-based Memory Selection Recurrent Network for Language Modeling

Authors: Da-Rong Liu, Shun-Po Chuang, Hung-yi Lee

Abstract: Recurrent neural networks (RNNs) have achieved great success in language modeling. However, since the RNNs have fixed size of memory, their memory cannot store all the information about the words it have seen before in the sentence, and thus the useful long-term information may be ignored when predicting the next words. In this paper, we propose Attention-based Memory Selection Recurrent Network (AMSRN), in which the model can review the information stored in the memory at each previous time step and select the relevant information to help generate the outputs. In AMSRN, the attention mechanism finds the time steps storing the relevant information in the memory, and memory selection determines which dimensions of the memory are involved in computing the attention weights and from which the information is extracted.In the experiments, AMSRN outperformed long short-term memory (LSTM) based language models on both English and Chinese corpora. Moreover, we investigate using entropy as a regularizer for attention weights and visualize how the attention mechanism helps language modeling.

URL: https://arxiv.org/abs/1611.08656

Notes: Language modeling c явной памятью

Improving Multi-Document Summarization via Text Classification

Authors: Ziqiang Cao, Wenjie Li, Sujian Li, Furu Wei

Abstract: Developed so far, multi-document summarization has reached its bottleneck due to the lack of sufficient training data and diverse categories of documents. Text classification just makes up for these deficiencies. In this paper, we propose a novel summarization system called TCSum, which leverages plentiful text classification data to improve the performance of multi-document summarization. TCSum projects documents onto distributed representations which act as a bridge between text classification and summarization. It also utilizes the classification results to produce summaries of different styles. Extensive experiments on DUC generic multi-document summarization datasets show that, TCSum can achieve the state-of-the-art performance without using any hand-crafted features and has the capability to catch the variations of summary styles with respect to different text categories.

URL: https://arxiv.org/abs/1611.09238

Notes: multi-task система, которая умеет учить эмбеддинги текстов для классификации и одновременно использовать их для суммаризации

Joint Copying and Restricted Generation for Paraphrase

Authors: Ziqiang Cao, Chuwei Luo, Wenjie Li, Sujian Li

Abstract: Many natural language generation tasks, such as abstractive summarization and text simplification, are paraphrase-orientated. In these tasks, copying and rewriting are two main writing modes. Most previous sequence-to-sequence (Seq2Seq) models use a single decoder and neglect this fact. In this paper, we develop a novel Seq2Seq model to fuse a copying decoder and a restricted generative decoder. The copying decoder finds the position to be copied based on a typical attention model. The generative decoder produces words limited in the source-specific vocabulary. To combine the two decoders and determine the final output, we develop a predictor to predict the mode of copying or rewriting. This predictor can be guided by the actual writing mode in the training data. We conduct extensive experiments on two different paraphrase datasets. The result shows that our model outperforms the state-of-the-art approaches in terms of both informativeness and language quality.

URL: https://arxiv.org/abs/1611.09235

Notes: новый подзод к генерации парафраз, копировать и генерировать совместно

Exploiting Unlabeled Data for Neural Grammatical Error Detection

Authors: Zhuoran Liu, Yang Liu

Abstract: Identifying and correcting grammatical errors in the text written by non-native writers has received increasing attention in recent years. Although a number of annotated corpora have been established to facilitate data-driven grammatical error detection and correction approaches, they are still limited in terms of quantity and coverage because human annotation is labor-intensive, time-consuming, and expensive. In this work, we propose to utilize unlabeled data to train neural network based grammatical error detection models. The basic idea is to cast error detection as a binary classification problem and derive positive and negative training examples from unlabeled data. We introduce an attention-based neural network to capture long-distance dependencies that influence the word being detected. Experiments show that the proposed approach significantly outperforms SVMs and convolutional networks with fixed-size context window.

URL: https://arxiv.org/abs/1611.08987

Notes: использование неразмеченных данных для поиска ошибок через генерацию данных для бинарной классификации

Deep Reinforcement Learning for Multi-Domain Dialogue Systems

Authors: Heriberto Cuayáhuitl, Seunghak Yu, Ashley Williamson, Jacob Carse

Abstract: Standard deep reinforcement learning methods such as Deep Q-Networks (DQN) for multiple tasks (domains) face scalability problems. We propose a method for multi-domain dialogue policy learning---termed NDQN, and apply it to an information-seeking spoken dialogue system in the domains of restaurants and hotels. Experimental results comparing DQN (baseline) versus NDQN (proposed) using simulations report that our proposed method exhibits better scalability and is promising for optimising the behaviour of multi-domain dialogue systems.

URL: https://arxiv.org/abs/1611.08675

Notes: надо смотреть, как люди применяют reinforcement learning к диалоговым системам

Learning to Compose Words into Sentences with Reinforcement Learning

Authors: Dani Yogatama, Phil Blunsom, Chris Dyer, Edward Grefenstette, Wang Ling

Abstract: We use reinforcement learning to learn tree-structured neural networks for computing representations of natural language sentences. In contrast with prior work on tree-structured models in which the trees are either provided as input or predicted using supervision from explicit treebank annotations, the tree structures in this work are optimized to improve performance on a downstream task. Experiments demonstrate the benefit of learning task-specific composition orders, outperforming both sequential encoders and recursive encoders based on treebank annotations. We analyze the induced trees and show that while they discover some linguistically intuitive structures (e.g., noun phrases, simple verb phrases), they are different than conventional English syntactic structures.

URL: https://arxiv.org/abs/1611.09100

Notes: строить предложения c помощью RL!!

A Simple, Fast Diverse Decoding Algorithm for Neural Generation

Authors: Jiwei Li, Will Monroe, Dan Jurafsky

Abstract: In this paper, we propose a simple, fast decoding algorithm that fosters diversity in neural generation. The algorithm modifies the standard beam search algorithm by adding an inter-sibling ranking penalty, favoring choosing hypotheses from diverse parents. We evaluate the proposed model on the tasks of dialogue response generation, abstractive summarization and machine translation. We find that diverse decoding helps across all tasks, especially those for which reranking is needed. We further propose a variation that is capable of automatically adjusting its diversity decoding rates for different inputs using reinforcement learning (RL). We observe a further performance boost from this RL technique. This paper includes material from the unpublished script "Mutual Information and Diverse Decoding Improve Neural Machine Translation" (Li and Jurafsky, 2016).

URL: https://arxiv.org/abs/1611.08562

Notes: люди научились поощрять разнообразие в генерации предложений

Scalable Bayesian Learning of Recurrent Neural Networks for Language Modeling

Authors: Zhe Gan, Chunyuan Li, Changyou Chen, Yunchen Pu, Qinliang Su, Lawrence Carin

Abstract: Recurrent neural networks (RNNs) have shown promising performance for language modeling. However, traditional training of RNNs using back-propagation through time often suffers from overfitting. One reason for this is that stochastic optimization (used for large training sets) does not provide good estimates of model uncertainty. This paper leverages recent advances in stochastic gradient Markov Chain Monte Carlo (also appropriate for large training sets) to learn weight uncertainty in RNNs. It yields a principled Bayesian learning algorithm, adding gradient noise during training (enhancing exploration of the model-parameter space) and model averaging when testing. Extensive experiments on various RNN models and across a broad range of applications demonstrate the superiority of the proposed approach over stochastic optimization.

URL: https://arxiv.org/abs/1611.08034

Notes: байесовский подход для language moding

Geometric deep learning: going beyond Euclidean data

Authors: Michael M. Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, Pierre Vandergheynst

Abstract: Many signal processing problems involve data whose underlying structure is non-Euclidean, but may be modeled as a manifold or (combinatorial) graph. For instance, in social networks, the characteristics of users can be modeled as signals on the vertices of the social graph. Sensor networks are graph models of distributed interconnected sensors, whose readings are modelled as time-dependent signals on the vertices. In genetics, gene expression data are modeled as signals defined on the regulatory network. In neuroscience, graph models are used to represent anatomical and functional structures of the brain. Modeling data given as points in a high-dimensional Euclidean space using nearest neighbor graphs is an increasingly popular trend in data science, allowing practitioners access to the intrinsic structure of the data. In computer graphics and vision, 3D objects are modeled as Riemannian manifolds (surfaces) endowed with properties such as color texture. Even more complex examples include networks of operators, e.g., functional correspondences or difference operators in a collection of 3D shapes, or orientations of overlapping cameras in multi-view vision ("structure from motion") problems. The complexity of geometric data and the availability of very large datasets (in the case of social networks, on the scale of billions) suggest the use of machine learning techniques. In particular, deep learning has recently proven to be a powerful tool for problems with large datasets with underlying Euclidean structure. The purpose of this paper is to overview the problems arising in relation to geometric deep learning and present solutions existing today for this class of problems, as well as key difficulties and future research directions.

URL: https://arxiv.org/abs/1611.08097

Notes: надо посмотреть, как это ложится на онтологии

Dialogue Learning With Human-In-The-Loop

Authors: Jiwei Li, Alexander H. Miller, Sumit Chopra, Marc'Aurelio Ranzato, Jason Weston

Abstract: An important aspect of developing conversational agents is to give a bot the ability to improve through communicating with humans and to learn from the mistakes that it makes. Most research has focused on learning from fixed training sets of labeled data rather than interacting with a dialogue partner in an online fashion. In this paper we explore this direction in a reinforcement learning setting where the bot improves its question-answering ability from feedback a teacher gives following its generated responses. We build a simulator that tests various aspects of such learning in a synthetic environment, and introduce models that work in this regime. Finally, real experiments with Mechanical Turk validate the approach.

URL: https://arxiv.org/abs/1611.09823

Notes: Свежая статья от FAIR по горячей теме RL в диалоговых системах

Intelligible Language Modeling with Input Switched Affine Networks

Authors: Jakob N. Foerster, Justin Gilmer, Jan Chorowski, Jascha Sohl-Dickstein, David Sussillo

Abstract: The computational mechanisms by which nonlinear recurrent neural networks (RNNs) achieve their goals remains an open question. There exist many problem domains where intelligibility of the network model is crucial for deployment. Here we introduce a recurrent architecture composed of input-switched affine transformations, in other words an RNN without any nonlinearity and with one set of weights per input. We show that this architecture achieves near identical performance to traditional architectures on language modeling of Wikipedia text, for the same number of model parameters. It can obtain this performance with the potential for computational speedup compared to existing methods, by precomputing the composed affine transformations corresponding to longer input sequences. As our architecture is affine, we are able to understand the mechanisms by which it functions using linear methods. For example, we show how the network linearly combines contributions from the past to make predictions at the current time step. We show how representations for words can be combined in order to understand how context is transferred across word boundaries. Finally, we demonstrate how the system can be executed and analyzed in arbitrary bases to aid understanding.

URL: https://arxiv.org/abs/1611.09434

Notes: RNN без нелинейностей

NewsQA: A Machine Comprehension Dataset

Authors: Adam Trischler, Tong Wang, Xingdi Yuan, Justin Harris, Alessandro Sordoni, Philip Bachman, Kaheer Suleman

Abstract: We present NewsQA, a challenging machine comprehension dataset of over 100,000 question-answer pairs. Crowdworkers supply questions and answers based on a set of over 10,000 news articles from CNN, with answers consisting in spans of text from the corresponding articles. We collect this dataset through a four-stage process designed to solicit exploratory questions that require reasoning. A thorough analysis confirms that NewsQA demands abilities beyond simple word matching and recognizing entailment. We measure human performance on the dataset and compare it to several strong neural models. The performance gap between humans and machines (25.3% F1) indicates that significant progress can be made on NewsQA through future research. The dataset is freely available at datasets.maluuba.com/NewsQA.

URL: https://arxiv.org/abs/1611.09830

Notes: новый датасет на понимание текста

Identity-sensitive Word Embedding through Heterogeneous Networks

Authors: Jian Tang, Meng Qu, Qiaozhu Mei

Abstract: Most existing word embedding approaches do not distinguish the same words in different contexts, therefore ignoring their contextual meanings. As a result, the learned embeddings of these words are usually a mixture of multiple meanings. In this paper, we acknowledge multiple identities of the same word in different contexts and learn the \textbf{identity-sensitive} word embeddings. Based on an identity-labeled text corpora, a heterogeneous network of words and word identities is constructed to model different-levels of word co-occurrences. The heterogeneous network is further embedded into a low-dimensional space through a principled network embedding approach, through which we are able to obtain the embeddings of words and the embeddings of word identities. We study three different types of word identities including topics, sentiments and categories. Experimental results on real-world data sets show that the identity-sensitive word embeddings learned by our approach indeed capture different meanings of words and outperforms competitive methods on tasks including text classification and word similarity computation.

URL: https://arxiv.org/abs/1611.09878

Notes: Статья о том, как учить разные эмбеддинги для слов значащих разное в разных контекстах.

GANs for Sequences of Discrete Elements with the Gumbel-softmax Distribution

Authors: Matt J. Kusner, José Miguel Hernández-Lobato

Abstract: Generative Adversarial Networks (GAN) have limitations when the goal is to generate sequences of discrete elements. The reason for this is that samples from a distribution on discrete objects such as the multinomial are not differentiable with respect to the distribution parameters. This problem can be avoided by using the Gumbel-softmax distribution, which is a continuous approximation to a multinomial distribution parameterized in terms of the softmax function. In this work, we evaluate the performance of GANs based on recurrent neural networks with Gumbel-softmax output distributions in the task of generating sequences of discrete elements.

URL: https://arxiv.org/abs/1611.04051

Notes: a paper from NIPS, thnx to Alex; Gumbel intro for me

PGQ: Combining policy gradient and Q-learning

Authors: Brendan O'Donoghue, Remi Munos, Koray Kavukcuoglu, Volodymyr Mnih

Abstract: Policy gradient is an efficient technique for improving a policy in a reinforcement learning setting. However, vanilla online variants are on-policy only and not able to take advantage of off-policy data. In this paper we describe a new technique that combines policy gradient with off-policy Q-learning, drawing experience from a replay buffer. This is motivated by making a connection between the fixed points of the regularized policy gradient algorithm and the Q-values. This connection allows us to estimate the Q-values from the action preferences of the policy, to which we apply Q-learning updates. We refer to the new technique as 'PGQ', for policy gradient and Q-learning. We also establish an equivalency between action-value fitting techniques and actor-critic algorithms, showing that regularized policy gradient techniques can be interpreted as advantage function learning algorithms. We conclude with some numerical examples that demonstrate improved data efficiency and stability of PGQ. In particular, we tested PGQ on the full suite of Atari games and achieved performance exceeding that of both asynchronous advantage actor-critic (A3C) and Q-learning.

URL: https://arxiv.org/abs/1611.01626

Notes: updated version of this paper has been released; PGQ is one of today's SOTA in RL

A Convolutional Encoder Model for Neural Machine Translation

Authors: Jonas Gehring, Michael Auli, David Grangier, Yann N. Dauphin

Abstract: The prevalent approach to neural machine translation relies on bi-directional LSTMs to encode the source sentence. In this paper we present a faster and simpler architecture based on a succession of convolutional layers. This allows to encode the entire source sentence simultaneously compared to recurrent networks for which computation is constrained by temporal dependencies. On WMT'16 English-Romanian translation we achieve competitive accuracy to the state-of-the-art and we outperform several recently published results on the WMT'15 English-German task. Our models obtain almost the same accuracy as a very deep LSTM setup on WMT'14 English-French translation. Our convolutional encoder speeds up CPU decoding by more than two times at the same or higher accuracy as a strong bi-directional LSTM baseline.

URL: https://arxiv.org/abs/1611.02344

Notes: great in its simplicity idea to use convolutional encoder improved SOTA on NMT

Reparameterization trick for discrete variables

Authors: Seiya Tokui, Issei Sato

Abstract: Low-variance gradient estimation is crucial for learning directed graphical models parameterized by neural networks, where the reparameterization trick is widely used for those with continuous variables. While this technique gives low-variance gradient estimates, it has not been directly applicable to discrete variables, the sampling of which inherently requires discontinuous operations. We argue that the discontinuity can be bypassed by marginalizing out the variable of interest, which results in a new reparameterization trick for discrete variables. This reparameterization greatly reduces the variance, which is understood by regarding the method as an application of common random numbers to the estimation. The resulting estimator is theoretically guaranteed to have a variance not larger than that of the likelihood-ratio method with the optimal input-dependent baseline. We give empirical results for variational learning of sigmoid belief networks.

URL: https://arxiv.org/abs/1611.01239

Notes: good explanation of reparam trick, here it is obvious to be close to REINFORCE algo

Categorical Reparameterization with Gumbel-Softmax

Authors: Eric Jang, Shixiang Gu, Ben Poole

Abstract: Categorical variables are a natural choice for representing discrete structure in the world. However, stochastic neural networks rarely use categorical latent variables due to the inability to backpropagate through samples. In this work, we present an efficient gradient estimator that replaces the non-differentiable sample from a categorical distribution with a differentiable sample from a novel Gumbel-Softmax distribution. This distribution has the essential property that it can be smoothly annealed into a categorical distribution. We show that our Gumbel-Softmax estimator outperforms state-of-the-art gradient estimators on structured output prediction and unsupervised generative modeling tasks with categorical latent variables, and enables large speedups on semi-supervised classification.

URL: https://arxiv.org/abs/1611.01144

Notes: Gumbel-softmax is introduced here

2016-12

Overcoming catastrophic forgetting in neural networks

Authors: James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, Raia Hadsell

Abstract: The ability to learn tasks in a sequential fashion is crucial to the development of artificial intelligence. Neural networks are not, in general, capable of this and it has been widely thought that catastrophic forgetting is an inevitable feature of connectionist models. We show that it is possible to overcome this limitation and train networks that can maintain expertise on tasks which they have not experienced for a long time. Our approach remembers old tasks by selectively slowing down learning on the weights important for those tasks. We demonstrate our approach is scalable and effective by solving a set of classification tasks based on the MNIST hand written digit dataset and by learning several Atari 2600 games sequentially.

URL: https://arxiv.org/abs/1612.00796

Notes: свежая статья от DeepMind о том, как избегать забывания в процессе обучения

Bootstrapping incremental dialogue systems: using linguistic knowledge to learn from minimal data

Authors: JDimitrios Kalatzis, Arash Eshghi, Oliver Lemon

Abstract: We present a method for inducing new dialogue systems from very small amounts of unannotated dialogue data, showing how word-level exploration using Reinforcement Learning (RL), combined with an incremental and semantic grammar - Dynamic Syntax (DS) - allows systems to discover, generate, and understand many new dialogue variants. The method avoids the use of expensive and time-consuming dialogue act annotations, and supports more natural (incremental) dialogues than turn-based systems. Here, language generation and dialogue management are treated as a joint decision/optimisation problem, and the MDP model for RL is constructed automatically. With an implemented system, we show that this method enables a wide range of dialogue variations to be automatically captured, even when the system is trained from only a single dialogue. The variants include question-answer pairs, over- and under-answering, self- and other-corrections, clarification interaction, split-utterances, and ellipsis. This generalisation property results from the structural knowledge and constraints present within the DS grammar, and highlights some limitations of recent systems built using machine learning techniques only.

URL: https://arxiv.org/abs/1612.00347

Notes: люди добавляют знания из лингвистики для обучения диалоговых систем c RL.

Temporal Attention-Gated Model for Robust Sequence Classification

Authors: JWenjie Pei, Tadas Baltrušaitis, David M.J. Tax, Louis-Philippe Morency

Abstract: Typical techniques for sequence classification are designed for well-segmented sequences which has been edited to remove noisy or irrelevant parts. Therefore, such methods cannot be easily applied on noisy sequences which are expected in real-world applications. We present the Temporal Attention-Gated Model (TAGM) which is able to deal with noisy sequences. Our model assimilates ideas from attention models and gated recurrent networks. Specifically, we employ an attention model to measure the relevance of each time step of a sequence to the final decision. We then use the relevant segments based on their attention scores in a novel gated recurrent network to learn the hidden representation for the classification. More importantly, our attention weights provide a physically meaningful interpretation for the salience of each time step in the sequence. We demonstrate the merits of our model in both interpretability and classification performance on a variety of tasks, including speech recognition, textual sentiment analysis and event recognition.

URL: https://arxiv.org/abs/1612.00385

Notes: attention, which is devived in temporal manner from the inputs along with actual recurrent NN

End-to-End Joint Learning of Natural Language Understanding and Dialogue Manager

Authors: Xuesong Yang, Yun-Nung Chen, Dilek Hakkani-Tur, Paul Crook, Xiujun Li, Jianfeng Gao, Li Deng

Abstract: Natural language understanding and dialogue policy learning are both essential in conversational systems that predict the next system actions in response to a current user utterance. Conventional approaches aggregate separate models of natural language understanding (NLU) and system action prediction (SAP) as a pipeline that is sensitive to noisy outputs of error-prone NLU. To address the issues, we propose an end-to-end deep recurrent neural network with limited contextual dialogue memory by jointly training NLU and SAP on DSTC4 multi-domain human-human dialogues. Experiments show that our proposed model significantly outperforms the state-of-the-art pipeline models for both NLU and SAP, which indicates that our joint model is capable of mitigating the affects of noisy NLU outputs, and NLU model can be refined by error flows backpropagating from the extra supervised signals of system actions.

URL: https://arxiv.org/abs/1612.00913

Notes: свежая статья от MSR по dialog state stacking

Reading Comprehension using Entity-based Memory Network

Authors: Xun Wang, Katsuhito Sudoh, Masaaki Nagata, Tomohide Shibata, Kawahara Daisuke, Kurohashi Sadao

Abstract: This paper introduces a novel neural network model for question answering, the \emph{entity-based memory network}. It enhances neural networks' ability of representing and calculating information over a long period by keeping records of entities contained in text. The core component is a memory pool which comprises entities' states. These entities' states are continuously updated according to the input text. Questions with regard to the input text are used to search the memory pool for related entities and answers are further predicted based on the states of retrieved entities. Compared with previous memory network models, the proposed model is capable of handling fine-grained information and more sophisticated relations based on entities. We formulated several different tasks as question answering problems and tested the proposed model. Experiments reported satisfying results.

URL: https://arxiv.org/abs/1612.03551

Notes:

FastText.zip: Compressing text classification models

Authors: Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Hérve Jégou, Tomas Mikolov

Abstract: We consider the problem of producing compact architectures for text classification, such that the full model fits in a limited amount of memory. After considering different solutions inspired by the hashing literature, we propose a method built upon product quantization to store word embeddings. While the original technique leads to a loss in accuracy, we adapt this method to circumvent quantization artefacts. Our experiments carried out on several benchmarks show that our approach typically requires two orders of magnitude less memory than fastText while being only slightly inferior with respect to accuracy. As a result, it outperforms the state of the art by a good margin in terms of the compromise between memory usage and accuracy.

URL: https://arxiv.org/abs/1612.03651

Notes:

Neural Machine Translation by Minimising the Bayes-risk with Respect to Syntactic Translation Lattices

Authors: Felix Stahlberg, Adrià de Gispert, Eva Hasler, Bill Byrne

Abstract: We present a novel scheme to combine neural machine translation (NMT) with traditional statistical machine translation (SMT). Our approach borrows ideas from linearised lattice minimum Bayes-risk decoding for SMT. The NMT score is combined with the Bayes-risk of the translation according the SMT lattice. This makes our approach much more flexible than n-best list or lattice rescoring as the neural decoder is not restricted to the SMT search space. We show an efficient and simple way to integrate risk estimation into the NMT decoder. We test our method on English-German and Japanese-English and report significant gains over lattice rescoring on several data sets for both single and ensembled NMT.

URL: https://arxiv.org/abs/1612.03791

Notes:

Context-aware Sentiment Word Identification: sentiword2vec

Authors: Yushi Yao, Guangjian Li

Abstract: Traditional sentiment analysis often uses sentiment dictionary to extract sentiment information in text and classify documents. However, emerging informal words and phrases in user generated content call for analysis aware to the context. Usually, they have special meanings in a particular context. Because of its great performance in representing inter-word relation, we use sentiment word vectors to identify the special words. Based on the distributed language model word2vec, in this paper we represent a novel method about sentiment representation of word under particular context, to be detailed, to identify the words with abnormal sentiment polarity in long answers. Result shows the improved model shows better performance in representing the words with special meaning, while keep doing well in representing special idiomatic pattern. Finally, we will discuss the meaning of vectors representing in the field of sentiment, which may be different from general object-based conditions.

URL: https://arxiv.org/abs/1612.03769

Notes:

A Character-Word Compositional Neural Language Model for Finnish

Authors: Matti Lankinen, Hannes Heikinheimo, Pyry Takala, Tapani Raiko, Juha Karhunen

Abstract: Inspired by recent research, we explore ways to model the highly morphological Finnish language at the level of characters while maintaining the performance of word-level models. We propose a new Character-to-Word-to-Character (C2W2C) compositional language model that uses characters as input and output while still internally processing word level embeddings. Our preliminary experiments, using the Finnish Europarl V7 corpus, indicate that C2W2C can respond well to the challenges of morphologically rich languages such as high out of vocabulary rates, the prediction of novel words, and growing vocabulary size. Notably, the model is able to correctly score inflectional forms that are not present in the training data and sample grammatically and semantically correct Finnish sentences character by character.

URL: https://arxiv.org/abs/1612.03266

Notes:

#HashtagWars: Learning a Sense of Humor

Authors: Peter Potash, Alexey Romanov, Anna Rumshisky

Abstract: In this work, we present a new dataset for computational humor, specifically comparative humor ranking, which attempts to eschew the ubiquitous binary approach to humor detection. The dataset consists of tweets that are humorous responses to a given hashtag. We describe the motivation for this new dataset, as well as the collection process, which includes a description of our semi-automated system for data collection. We also present initial experiments for this dataset using both unsupervised and supervised approaches. Our best supervised system achieved 63.7% accuracy, suggesting that this task is much more difficult than comparable humor detection tasks. Initial experiments indicate that a character-level model is more suitable for this task than a token-level model, likely due to a large amount of puns that can be captured by a character-level model.

URL: https://arxiv.org/abs/1612.03216

Notes:

Evaluating Creative Language Generation: The Case of Rap Lyric Ghostwriting

Authors: Peter Potash, Alexey Romanov, Anna Rumshisky

Abstract: Language generation tasks that seek to mimic human ability to use language creatively are difficult to evaluate, since one must consider creativity, style, and other non-trivial aspects of the generated text. The goal of this paper is to develop evaluation methods for one such task, ghostwriting of rap lyrics, and to provide an explicit, quantifiable foundation for the goals and future directions of this task. Ghostwriting must produce text that is similar in style to the emulated artist, yet distinct in content. We develop a novel evaluation methodology that addresses several complementary aspects of this task, and illustrate how such evaluation can be used to meaningfully analyze system performance. We provide a corpus of lyrics for 13 rap artists, annotated for stylistic similarity, which allows us to assess the feasibility of manual evaluation for generated verse.

URL: https://arxiv.org/abs/1612.03205

Notes:

Generalizable Features From Unsupervised Learning

Authors: Mehdi Mirza, Aaron Courville, Yoshua Bengio

Abstract: Humans learn a predictive model of the world and use this model to reason about future events and the consequences of actions. In contrast to most machine predictors, we exhibit an impressive ability to generalize to unseen scenarios and reason intelligently in these settings. One important aspect of this ability is physical intuition(Lake et al., 2016). In this work, we explore the potential of unsupervised learning to find features that promote better generalization to settings outside the supervised training distribution. Our task is predicting the stability of towers of square blocks. We demonstrate that an unsupervised model, trained to predict future frames of a video sequence of stable and unstable block configurations, can yield features that support extrapolating stability prediction to blocks configurations outside the training set distribution

URL: https://arxiv.org/abs/1612.03809

Notes: свежая статья от Бенжио про извелечение abx из unsupervised pre-train

StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks

Authors: Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaolei Huang, Xiaogang Wang, Dimitris Metaxas

Abstract: Synthesizing photo-realistic images from text descriptions is a challenging problem in computer vision and has many practical applications. Samples generated by existing text-to-image approaches can roughly reflect the meaning of the given descriptions, but they fail to contain necessary details and vivid object parts. In this paper, we propose stacked Generative Adversarial Networks (StackGAN) to generate photo-realistic images conditioned on text descriptions. The Stage-I GAN sketches the primitive shape and basic colors of the object based on the given text description, yielding Stage-I low resolution images. The Stage-II GAN takes Stage-I results and text descriptions as inputs, and generates high resolution images with photo-realistic details. The Stage-II GAN is able to rectify defects and add compelling details with the refinement process. Samples generated by StackGAN are more plausible than those generated by existing approaches. Importantly, our StackGAN for the first time generates realistic 256 x 256 images conditioned on only text descriptions, while state-of-the-art methods can generate at most 128 x 128 images. To demonstrate the effectiveness of the proposed StackGAN, extensive experiments are conducted on CUB and Oxford-102 datasets, which contain enough object appearance variations and are widely-used for text-to-image generation analysis.

URL: https://arxiv.org/abs/1612.03242

Notes: GAN рядом c текстами

Large-Margin Softmax Loss for Convolutional Neural Networks

Authors: Weiyang Liu, Yandong Wen, Zhiding Yu, Meng Yang

Abstract: SCross-entropy loss together with softmax is arguably one of the most common used supervision components in convolutional neural networks (CNNs). Despite its simplicity, popularity and excellent performance, the component does not explicitly encourage discriminative learning of features. In this paper, we propose a generalized large-margin softmax (L-Softmax) loss which explicitly encourages intra-class compactness and inter-class separability between learned features. Moreover, L-Softmax not only can adjust the desired margin but also can avoid overfitting. We also show that the L-Softmax loss can be optimized by typical stochastic gradient descent. Extensive experiments on four benchmark datasets demonstrate that the deeply-learned features with L-softmax loss become more discriminative, hence significantly boosting the performance on a variety of visual classification and verification tasks.

URL: https://arxiv.org/abs/1612.02295

Notes: Допилка для софтмакса, чтобы были компактные кластера при классификации

Tracking the World State with Recurrent Entity Networks

Authors: Mikael Henaff, Jason Weston, Arthur Szlam, Antoine Bordes, Yann LeCun

Abstract: We introduce a new model, the Recurrent Entity Network (EntNet). It is equipped with a dynamic long-term memory which allows it to maintain and update a representation of the state of the world as it receives new data. For language understanding tasks, it can reason on-the-fly as it reads text, not just when it is required to answer a question or respond as is the case for a Memory Network (Sukhbaatar et al., 2015). Like a Neural Turing Machine or Differentiable Neural Computer (Graves et al., 2014; 2016) it maintains a fixed size memory and can learn to perform location and content-based read and write operations. However, unlike those models it has a simple parallel architecture in which several memory locations can be updated simultaneously. The EntNet sets a new state-of-the-art on the bAbI tasks, and is the first method to solve all the tasks in the 10k training examples setting. We also demonstrate that it can solve a reasoning task which requires a large number of supporting facts, which other methods are not able to solve, and can generalize past its training horizon. It can also be practically used on large scale datasets such as Children's Book Test, where it obtains competitive performance, reading the story in a single pass.

URL: https://arxiv.org/abs/1612.03969

Notes: EntNet - next step of Memory Networks architecture

Online Sequence-to-Sequence Reinforcement Learning for Open-Domain Conversational Agents

Authors: Nabiha Asghar, Pascal Poupart, Jiang Xin, Hang Li

Abstract: We propose an online, end-to-end, deep reinforcement learning technique to develop generative conversational agents for open-domain dialogue. We use a unique combination of offline two-phase supervised learning and online reinforcement learning with human users to train our agent. While most existing research proposes hand-crafted and develop-defined reward functions for reinforcement, we devise a novel reward mechanism based on a variant of Beam Search and one-character user-feedback at each step. Experiments show that our model, when trained on a small and shallow Seq2Seq network, successfully promotes the generation of meaningful, diverse and interesting responses, and can be used to train agents with customized personas and conversational styles.

URL: https://arxiv.org/abs/1612.03929

Notes: seq2seq RL!!

Building Large Machine Reading-Comprehension Datasets using Paragraph Vectors

Authors: Radu Soricut, Nan Ding

Abstract: We present a dual contribution to the task of machine reading-comprehension: a technique for creating large-sized machine-comprehension (MC) datasets using paragraph-vector models; and a novel, hybrid neural-network architecture that combines the representation power of recurrent neural networks with the discriminative power of fully-connected multi-layered networks. We use the MC-dataset generation technique to build a dataset of around 2 million examples, for which we empirically determine the high-ceiling of human performance (around 91% accuracy), as well as the performance of a variety of computer models. Among all the models we have experimented with, our hybrid neural-network architecture achieves the highest performance (83.2% accuracy). The remaining gap to the human-performance ceiling provides enough room for future model improvements.

URL: https://arxiv.org/abs/1612.04342

Notes: новые датасеты по задаче machine comprehension

Multi-Perspective Context Matching for Machine Comprehension

Authors: Zhiguo Wang, Haitao Mi, Wael Hamza, Radu Florian

Abstract: Previous machine comprehension (MC) datasets are either too small to train end-to-end deep learning models, or not difficult enough to evaluate the ability of current MC techniques. The newly released SQuAD dataset alleviates these limitations, and gives us a chance to develop more realistic MC models. Based on this dataset, we propose a Multi-Perspective Context Matching (MPCM) model, which is an end-to-end system that directly predicts the answer beginning and ending points in a passage. Our model first adjusts each word-embedding vector in the passage by multiplying a relevancy weight computed against the question. Then, we encode the question and weighted passage by using bi-directional LSTMs. For each point in the passage, our model matches the context of this point against the encoded question from multiple perspectives and produces a matching vector. Given those matched vectors, we employ another bi-directional LSTM to aggregate all the information and predict the beginning and ending points. Experimental result on the test set of SQuAD shows that our model achieves a competitive result on the leaderboard.

URL: https://arxiv.org/abs/1612.04211

Notes: new model for reading comprehension, based on SQuAD

Information Extraction with Character-level Neural Networks and Noisy Supervision

Authors: Philipp Meerkamp (Bloomberg LP), Zhengyi Zhou (AT&T Labs Research)

Abstract: We present an architecture for information extraction from text that augments an existing parser with a character-level neural network. To train the neural network, we compute a measure of consistency of extracted data with existing databases, and use it as a form of noisy supervision. Our architecture combines the ability of constraint-based information extraction system to easily incorporate domain knowledge and constraints with the ability of deep neural networks to leverage large amounts of data to learn complex features. The system led to large improvements over a mature and highly tuned constraint-based information extraction system used at Bloomberg for financial language text. At the same time, the new system massively reduces the development effort, allowing rule-writers to write high-recall constraints while relying on the deep neural network to remove false positives and boost precision.

URL: https://arxiv.org/abs/1612.04118

Notes: noisy supervision - it could be helpful for us

Structured Sequence Modeling with Graph Convolutional Recurrent Networks

Authors: Youngjoo Seo, Michaël Defferrard, Pierre Vandergheynst, Xavier Bresson

Abstract: This paper introduces Graph Convolutional Recurrent Network (GCRN), a deep learning model able to predict structured sequences of data. Precisely, GCRN is a generalization of classical recurrent neural networks (RNN) to data structured by an arbitrary graph. Such structured sequences can represent series of frames in videos, spatio-temporal measurements on a network of sensors, or random walks on a vocabulary graph for natural language modeling. The proposed model combines convolutional neural networks (CNN) on graphs to identify spatial structures and RNN to find dynamic patterns. We study two possible architectures of GCRN, and apply the models to two practical problems: predicting moving MNIST data, and modeling natural language with the Penn Treebank dataset. Experiments show that exploiting simultaneously graph spatial and dynamic information about data can improve both precision and learning speed.

URL: https://arxiv.org/abs/1612.07659

Notes: graph models for sequence learning

Highway and Residual Networks learn Unrolled Iterative Estimation

Authors: Klaus Greff, Rupesh K. Srivastava, Jürgen Schmidhuber

Abstract: The past year saw the introduction of new architectures such as Highway networks and Residual networks which, for the first time, enabled the training of feedforward networks with dozens to hundreds of layers using simple gradient descent. While depth of representation has been posited as a primary reason for their success, there are indications that these architectures defy a popular view of deep learning as a hierarchical computation of increasingly abstract features at each layer. In this report, we argue that this view is incomplete and does not adequately explain several recent findings. We propose an alternative viewpoint based on unrolled iterative estimation---a group of successive layers iteratively refine their estimates of the same features instead of computing an entirely new representation. We demonstrate that this viewpoint directly leads to the construction of Highway and Residual networks. Finally we provide preliminary experiments to discuss the similarities and differences between the two architectures.

URL: https://arxiv.org/abs/1612.07771

Notes: fresh paper from Schmidhuber, more on residual networks

Continuous multilinguality with language vectors

Authors: Robert Östling, Jörg Tiedemann

Abstract: Most existing models for multilingual natural language processing (NLP) treat language as a discrete category, and make predictions for either one language or the other. In contrast, we propose using continuous vector representations of language. We show that these can be learned efficiently with a character-based neural language model, and used to improve inference about language varieties not seen during training. In experiments with 1303 Bible translations into 990 different languages, we empirically explore the capacity of multilingual language models, and also show that the language vectors capture genetic relationships between languages.

URL: https://arxiv.org/abs/1612.07486

Notes: fresh paper for joint multilingual word vector representations

Language Modeling with Gated Convolutional Networks

Authors: Yann N. Dauphin, Angela Fan, Michael Auli, David Grangier

Abstract: The pre-dominant approach to language modeling to date is based on recurrent neural networks. In this paper we present a convolutional approach to language modeling. We introduce a novel gating mechanism that eases gradient propagation and which performs better than the LSTM-style gating of (Oord et al, 2016) despite being simpler. We achieve a new state of the art on WikiText-103 as well as a new best single-GPU result on the Google Billion Word benchmark. In settings where latency is important, our model achieves an order of magnitude speed-up compared to a recurrent baseline since computation can be parallelized over time. To our knowledge, this is the first time a non-recurrent approach outperforms strong recurrent models on these tasks.

URL: https://arxiv.org/abs/1612.08083

Notes: the next step in architectures from RNN to CNN, after Socher's semi-RNNs; these guys performed testings on tanh & sigmoid over the convolution, the sigmoid is better; they used up to 13 such layers (conv and gate on top of)

A Context-aware Attention Network for Interactive Question Answering

Authors: Huayu Li, Martin Renqiang Min, Yong Ge, Asim Kadav

Abstract: We develop a new model for Interactive Question Answering (IQA), using Gated-Recurrent-Unit recurrent networks (GRUs) as encoders for statements and questions, and another GRU as a decoder for outputs. Distinct from previous work, our approach employs context-dependent word-level attention for more accurate statement representations and question-guided sentence-level attention for better context modeling. Employing these mechanisms, our model accurately understands when it can output an answer or when it requires generating a supplementary question for additional input. When available, user's feedback is encoded and directly applied to update sentence-level attention to infer the answer. Extensive experiments on QA and IQA datasets demonstrate quantitatively the effectiveness of our model with significant improvement over conventional QA models.

URL: https://arxiv.org/abs/1612.07411

Notes: interactive user-interaction, like Human-in-the-Loop paper from FAIR, and more generally active learning approach

Understanding Neural Networks through Representation Erasure

Authors: Jiwei Li, Will Monroe, Dan Jurafsky

Abstract: While neural networks have been successfully applied to many natural language processing tasks, they come at the cost of interpretability. In this paper, we propose a general methodology to analyze and interpret decisions from a neural model by observing the effects on the model of erasing various parts of the representation, such as input word-vector dimensions, intermediate hidden units, or input words. We present several approaches to analyzing the effects of such erasure, from computing the relative difference in evaluation metrics, to using reinforcement learning to erase the minimum set of input words in order to flip a neural model's decision. In a comprehensive analysis of multiple NLP tasks, including linguistic feature classification, sentence-level sentiment analysis, and document level sentiment aspect prediction, we show that the proposed methodology not only offers clear explanations about neural model decisions, but also provides a way to conduct error analysis on neural models.

URL: https://arxiv.org/abs/1612.08220

Notes: fresh paper from Dan Jurafsky's group about RL application to NLP, they promise error analisys

Abstractive Headline Generation for Spoken Content by Attentive Recurrent Neural Networks with ASR Error Modeling

Authors: Lang-Chi Yu, Hung-yi Lee, Lin-shan Lee

Abstract: Headline generation for spoken content is important since spoken content is difficult to be shown on the screen and browsed by the user. It is a special type of abstractive summarization, for which the summaries are generated word by word from scratch without using any part of the original content. Many deep learning approaches for headline generation from text document have been proposed recently, all requiring huge quantities of training data, which is difficult for spoken document summarization. In this paper, we propose an ASR error modeling approach to learn the underlying structure of ASR error patterns and incorporate this model in an Attentive Recurrent Neural Network (ARNN) architecture. In this way, the model for abstractive headline generation for spoken content can be learned from abundant text data and the ASR data for some recognizers. Experiments showed very encouraging results and verified that the proposed ASR error model works well even when the input spoken content is recognized by a recognizer very different from the one the model learned from.

URL: https://arxiv.org/abs/1612.08375

Notes: interesting by ASR error modeling, could be useful for me

Text Summarization using Deep Learning and Ridge Regression

Authors: Karthik Bangalore Mani

Abstract: We develop models and extract relevant features for automatic text summarization and investigate the performance of different models on the DUC 2001 dataset. Two different models were developed, one being a ridge regressor and the other one was a multi-layer perceptron. The hyperparameters were varied and their performance were noted. We segregated the summarization task into 2 main steps, the first being sentence ranking and the second step being sentence selection. In the first step, given a document, we sort the sentences based on their Importance, and in the second step, in order to obtain non-redundant sentences, we weed out the sentences that are have high similarity with the previously selected sentences.

URL: https://arxiv.org/abs/1612.08333

Notes: ranking-selection for text summarization, little bit outdated approah as for me, but could be interesting to compare with

Here's My Point: Argumentation Mining with Pointer Networks

Authors: Peter Potash, Alexey Romanov, Anna Rumshisky

Abstract: One of the major goals in automated argumentation mining is to uncover the argument structure present in argumentative text. In order to determine this structure, one must understand how different individual components of the overall argument are linked. General consensus in this field dictates that the argument components form a hierarchy of persuasion, which manifests itself in a tree structure. This work provides the first neural network-based approach to argumentation mining, focusing on the two tasks of extracting links between argument components, and classifying types of argument components. In order to solve this problem, we propose to use a joint model that is based on a Pointer Network architecture. A Pointer Network is appealing for this task for the following reasons: 1) It takes into account the sequential nature of argument components; 2) By construction, it enforces certain properties of the tree structure present in argument relations; 3) The hidden representations can be applied to auxiliary tasks. In order to extend the contribution of the original Pointer Network model, we construct a joint model that simultaneously attempts to learn the type of argument component, as well as continuing to predict links between argument components. The proposed joint model achieves state-of-the-art results on two separate evaluation corpora, achieving far superior performance than a regular Pointer Network model. Our results show that optimizing for both tasks, and adding a fully-connected layer prior to recurrent neural network input, is crucial for high performance.

URL: https://arxiv.org/abs/1612.08994

Notes: pointer networks, augmentation, seems to be interesting

Deep Semi-Supervised Learning with Linguistically Motivated Sequence Labeling Task Hierarchies

Authors: Jonathan Godwin, Pontus Stenetorp, Sebastian Riedel

Abstract: In this paper we present a novel Neural Network algorithm for conducting semi-supervised learning for sequence labeling tasks arranged in a linguistically motivated hierarchy. This relationship is exploited to regularise the representations of supervised tasks by backpropagating the error of the unsupervised task through the supervised tasks. We introduce a neural network where lower layers are supervised by junior downstream tasks and the final layer task is an auxiliary unsupervised task. The architecture shows improvements of up to two percentage points F1 for Chunking compared to a plausible baseline.

URL: https://arxiv.org/abs/1612.09113

Notes: semi-supervised learning in natural language sequences

Modeling documents with Generative Adversarial Networks

Authors: John Glover

Abstract: This paper describes a method for using Generative Adversarial Networks to learn distributed representations of natural language documents. We propose a model that is based on the recently proposed Energy-Based GAN, but instead uses a Denoising Autoencoder as the discriminator network. Document representations are extracted from the hidden layer of the discriminator and evaluated both quantitatively and qualitatively.

URL: https://arxiv.org/abs/1612.09122

Notes: negative result in generating doc embeddings with GANs

The Predictron: End-To-End Learning and Planning

Authors: David Silver, Hado van Hasselt, Matteo Hessel, Tom Schaul, Arthur Guez, Tim Harley, Gabriel Dulac-Arnold, David Reichert, Neil Rabinowitz, Andre Barreto, Thomas Degris

Abstract: One of the key challenges of artificial intelligence is to learn models that are effective in the context of planning. In this document we introduce the predictron architecture. The predictron consists of a fully abstract model, represented by a Markov reward process, that can be rolled forward multiple "imagined" planning steps. Each forward pass of the predictron accumulates internal rewards and values over multiple planning depths. The predictron is trained end-to-end so as to make these accumulated values accurately approximate the true value function. We applied the predictron to procedurally generated random mazes and a simulator for the game of pool. The predictron yielded significantly more accurate predictions than conventional deep neural network architectures.

URL: https://arxiv.org/abs/1612.08810

Notes: fresh paper from Silver, end-to-end rl

Efficient iterative policy optimization

Authors: Nicolas Le Roux

Abstract: We tackle the issue of finding a good policy when the number of policy updates is limited. This is done by approximating the expected policy reward as a sequence of concave lower bounds which can be efficiently maximized, drastically reducing the number of policy updates required to achieve good performance. We also extend existing methods to negative rewards, enabling the use of control variates.

URL: https://arxiv.org/abs/1612.08967

Notes: new policy iteration method

Deep Learning and Hierarchal Generative Models

Authors: Elchanan Mossel

Abstract: In this paper we propose a new prism for studying deep learning motivated by connections between deep learning and evolution. Our main contributions are: 1, We introduce of a sequence of increasingly complex hierarchal generative models which interpolate between standard Markov models on trees (phylogenetic models) and deep learning models. 2. Formal definitions of classes of algorithms that are not deep. 3. Rigorous proofs showing that such classes are information theoretically much weaker than optimal "deep" learning algorithms. In our models, deep learning is performed efficiently and proven to classify correctly with high probability. All of the models and results are in the semi-supervised setting. Many open problems and future directions are presented.

URL: https://arxiv.org/abs/1612.09057

Notes: some proofs for hierarchical generative models

A Basic Recurrent Neural Network Model

Authors: Fathi M. Salem

Abstract: We present a model of a basic recurrent neural network (or bRNN) that includes a separate linear term with a slightly "stable" fixed matrix to guarantee bounded solutions and fast dynamic response. We formulate a state space viewpoint and adapt the constrained optimization Lagrange Multiplier (CLM) technique and the vector Calculus of Variations (CoV) to derive the (stochastic) gradient descent. In this process, one avoids the commonly used re-application of the circular chain-rule and identifies the error back-propagation with the co-state backward dynamic equations. We assert that this bRNN can successfully perform regression tracking of time-series. Moreover, the "vanishing and exploding" gradients are explicitly quantified and explained through the co-state dynamics and the update laws. The adapted CoV framework, in addition, can correctly and principally integrate new loss functions in the network on any variable and for varied goals, e.g., for supervised learning on the outputs and unsupervised learning on the internal (hidden) states.

URL: https://arxiv.org/abs/1612.09022

Notes: new theoretical results for RNNs

A Joint Speaker-Listener-Reinforcer Model for Referring Expressions

Authors: Licheng Yu, Hao Tan, Mohit Bansal, Tamara L. Berg

Abstract: Referring expressions are natural language constructions used to identify particular objects within a scene. In this paper, we propose a unified framework for the tasks of referring expression comprehension and generation. Our model is composed of three modules: speaker, listener, and reinforcer. The speaker generates referring expressions, the listener comprehends referring expressions, and the reinforcer introduces a reward function to guide sampling of more discriminative expressions. The listener-speaker modules are trained jointly in an end-to-end learning framework, allowing the modules to be aware of one another during learning while also benefiting from the discriminative reinforcer's feedback. We demonstrate that this unified framework and training achieves state-of-the-art results for both comprehension and generation on three referring expression datasets. Project and demo page: this https URL

URL: https://arxiv.org/abs/1612.09542

Notes: RL framework for teaching natural language

Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses

Authors: Ryan Lowe, Michael Noseworthy, Iulian V. Serban, Nicolas Angelard-Gontier, Yoshua Bengio, Joelle Pineau

Abstract: Automatically evaluating the quality of dialogue responses for unstructured domains is a challenging problem. Unfortunately, existing automatic evaluation metrics are biased and correlate very poorly with human judgements of response quality (Liu et al., 2016). Yet having an accurate automatic evaluation procedure is crucial for dialogue research, as it allows rapid prototyping and testing of new models with fewer expensive human evaluations. In response to this challenge, we formulate automatic dialogue evaluation as a learning problem. We present an evaluation model (ADEM) that learns to predict human-like scores to input responses, using a new dataset of human response scores. We show that the ADEM model’s predictions correlate significantly, and at level much higher than word-overlap metrics such as BLEU, with human judgements at both the utterance and system-level. We also show that ADEM can generalize to evaluating dialogue models unseen during training, an important step for automatic dialogue evaluation.

URL: https://openreview.net/pdf?id=HJ5PIaseg

Notes: new dialog quality metric, little bit poor as for me: RNN trained to embed resporses, and its output vectors are compared, but after all it is correlated a little with human judgement

Neural Networks for Joint Sentence Classification in Medical Paper Abstracts

Authors: Franck Dernoncourt, Ji Young Lee, Peter Szolovits

Abstract: Existing models based on artificial neural networks (ANNs) for sentence classification often do not incorporate the context in which sentences appear, and classify sentences individually. However, traditional sentence classification approaches have been shown to greatly benefit from jointly classifying subsequent sentences, such as with conditional random fields. In this work, we present an ANN architecture that combines the effectiveness of typical ANN models to classify sentences in isolation, with the strength of structured prediction. Our model achieves state-of-the-art results on two different datasets for sequential sentence classification in medical abstracts.

URL: https://arxiv.org/abs/1612.05251

Notes: basic architecture to start with to work with texts today, as for me; some results in classification also

Spatially Adaptive Computation Time for Residual Networks

Authors: Michael Figurnov, Maxwell D. Collins, Yukun Zhu, Li Zhang, Jonathan Huang, Dmitry Vetrov, Ruslan Salakhutdinov

Abstract: This paper proposes a deep learning architecture based on Residual Network that dynamically adjusts the number of executed layers for the regions of the image. This architecture is end-to-end trainable, deterministic and problem-agnostic. It is therefore applicable without any modifications to a wide range of computer vision problems such as image classification, object detection and image segmentation. We present experimental results showing that this model improves the computational efficiency of Residual Networks on the challenging ImageNet classification and COCO object detection datasets. Additionally, we evaluate the computation time maps on the visual saliency dataset cat2000 and find that they correlate surprisingly well with human eye fixation positions.

URL: https://arxiv.org/abs/1612.02297

Notes: adaptive computational time for resnet: we can exclude some blocks from computation for certain regions of an image

Tunable Efficient Unitary Neural Networks (EUNN) and their application to RNNs

Authors: Li Jing, Yichen Shen, Tena Dubček, John Peurifoy, Scott Skirlo, Yann LeCun, Max Tegmark, Marin Soljačić

Abstract: Using unitary (instead of general) matrices in artificial neural networks (ANNs) is a promising way to solve the gradient explosion/vanishing problem, as well as to enable ANNs to learn long-term correlations in the data. This approach appears particularly promising for Recurrent Neural Networks (RNNs). In this work, we present a new architecture for implementing an Efficient Unitary Neural Network (EUNNs); its main advantages can be summarized as follows. Firstly, the representation capacity of the unitary space in an EUNN is fully tunable, ranging from a subspace of SU(N) to the entire unitary space. Secondly, the computational complexity for training an EUNN is merely $\mathcal{O}(1)$ per parameter. Finally, we test the performance of EUNNs on the standard copying task, the pixel-permuted MNIST digit recognition benchmark as well as the Speech Prediction Test (TIMIT). We find that our architecture significantly outperforms both other state-of-the-art unitary RNNs and the LSTM architecture, in terms of the final performance and/or the wall-clock training speed. EUNNs are thus promising alternatives to RNNs and LSTMs for a wide variety of applications.

URL: https://arxiv.org/abs/1612.05231

Notes: efficent unitary RNN from LeCun; physics-inspired complex matrix factorization bringing O(1) computation complexity

Files

PAPERS.md

Latest commit

History

PAPERS.md

File metadata and controls

Table of Contents

Articles

1992

Simple Statistical Gradient-Following for Connectionist Reinforcement Learning

2009

A Scalable Hierarchical Distributed Language Model

2011

On optimization methods for deep learning

Natural Language Processing (almost) from Scratch

2012

A Fast and Simple Algorithm for Training Neural Probabilistic Language Models

2013

Concurrent Reinforcement Learning from Customer Interactions.

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

Learning word embeddings efficiently with noise-contrastive estimation

Exploiting Similarities among Languages for Machine Translation

2014

A Clockwork RNN

A Convolutional Neural Network for Modelling Sentences

Neural Machine Translation by Jointly Learning to Align and Translate

2015

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Cyclical Learning Rates for Training Neural Networks

A Neural Attention Model for Abstractive Sentence Summarization

Unitary Evolution Recurrent Neural Networks

Adversarial Autoencoders

A Theoretically Grounded Application of Dropout in Recurrent Neural Networks

Network Representation Learning with Rich Text Information

Language Understanding for Text-based Games Using Deep Reinforcement Learning

Neural Machine Translation of Rare Words with Subword Units

2016-01

Incorporating Structural Alignment Biases into an Attentional Neural Translation Model

2016-02

Exploring the Limits of Language Modeling

Massively Multilingual Word Embeddings

2016-03

A Persona-Based Neural Conversation Model

Bayesian Neural Word Embedding

Sentence Pair Scoring: Towards Unified Framework for Text Comprehension

End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF

How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation

Incorporating Copying Mechanism in Sequence-to-Sequence Learning

Zero-Shot Learning of Intent Embeddings for Expansion by Convolutional Deep Structured Semantic Models

2016-05

Learning End-to-End Goal-Oriented Dialog

2016-06

End-to-end LSTM-based dialog control optimized with supervised and reinforcement learning

Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations

A Decomposable Attention Model for Natural Language Inference

2016-07

Representation learning for very short texts using weighted word embedding aggregation

Recurrent Neural Machine Translation

Convolutional Neural Networks Analyzed via Convolutional Sparse Coding

Tweet2Vec: Learning Tweet Embeddings Using Character-level CNN-LSTM Encoder-Decoder

Machine Learned Resume-Job Matching Solution

Automatic Attribute Discovery with Neural Activations

Deep nets for local manifold learning

CFGs-2-NLU: Sequence-to-Sequence Learning for Mapping Utterances to Semantics and Pragmatics

Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings

Layer Normalization

Sequence to sequence learning for unconstrained scene text recognition

Dataset and Neural Recurrent Sequence Labeling Model for Open-Domain Factoid Question Answering

Compositional Sequence Labeling Models for Error Detection in Learner Writing

Training Skinny Deep Neural Networks with Iterative Hard Thresholding Methods

Stochastic Backpropagation through Mixture Density Distributions

Neural Contextual Conversation Learning with Labeled Question-Answering Pairs

Constructing a Natural Language Inference Dataset using Generative Neural Networks

An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation

Neural Machine Translation with Recurrent Attention Modeling

An Empirical Evaluation of various Deep Learning Architectures for Bi-Sequence Classification Tasks

DSD: Regularizing Deep Neural Networks with Dense-Sparse-Dense Training Flow

Neural Semantic Encoders

Neural Discourse Modeling of Conversations

Neural Tree Indexers for Text Understanding