Deep Learning for NLP resources

State of the art resources for NLP sequence modeling tasks such as machine translation, image captioning, and dialog.

Deep Learning for NLP

Stanford Natural Language Processing
Intro NLP course with videos. This has no deep learning. But it is a good primer for traditional nlp. Covers topics such as sentence segmentation, word tokenizing, word normalization, n-grams, named entity recognition, part of speech tagging. Currently not available

Stanford CS 224D: Deep Learning for NLP class
Richard Socher. (2016) Class with syllabus, and slides.
Videos: 2015 lectures / 2016 lectures

A Primer on Neural Network Models for Natural Language Processing
Yoav Goldberg. October 2015. No new info, 75 page summary of state of the art.

Oxford Deep Learning for NLP class
Phil Blunsom. (2017) Class by Deep Mind NLP Group.
Lecture slides, videos, and practicals: Github Repository
Currently ongoing

Word Vectors

Resources about word vectors, aka word embeddings, and distributed representations for words.
Word vectors are numeric representations of words where similar words have similar vectors. Word vectors are often used as input to deep learning systems. This process is sometimes called pretraining.

A neural probabilistic language model.
Bengio 2003. Seminal paper on word vectors.

Efficient Estimation of Word Representations in Vector Space
Mikolov et al. 2013. Word2Vec generates word vectors in an unsupervised way by attempting to predict words from a corpus. Describes Continuous Bag-of-Words (CBOW) and Continuous Skip-gram models for learning word vectors.
Skip-gram takes center word and predict outside words. Skip-gram is better for large datasets.
CBOW - takes outside words and predict the center word. CBOW is better for smaller datasets.

Distributed Representations of Words and Phrases and their Compositionality
Mikolov et al. 2013. Learns vectors for phrases such as "New York Times." Includes optimizations for skip-gram: heirachical softmax, and negative sampling. Subsampling frequent words. (i.e. frequent words like "the" are skipped periodically to speed things up and improve vector for less frequently used words)

Linguistic Regularities in Continuous Space Word Representations
Mikolov et al. 2013. Performs well on word similarity and analogy task. Expands on famous example: King – Man + Woman = Queen
Word2Vec source code
Word2Vec tutorial in TensorFlow

word2vec Parameter Learning Explained
Rong 2014

Articles explaining word2vec: Deep Learning, NLP, and Representations and The amazing power of word vectors

GloVe: Global vectors for word representation
Pennington, Socher, Manning. 2014. Creates word vectors and relates word2vec to matrix factorizations. Evalutaion section led to controversy by Yoav Goldberg
Glove source code and training data

Enriching Word Vectors with Subword Information
Bojanowski, Grave, Joulin, Mikolov 2016
FastText Code

Sentiment Analysis

Thought vectors are numeric representations for sentences, paragraphs, and documents. This concept is used for many text classification tasks such as sentiment analysis.

Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank
Socher et al. 2013. Introduces Recursive Neural Tensor Network and dataset: "sentiment treebank." Includes demo site. Uses a parse tree.

Distributed Representations of Sentences and Documents
Le, Mikolov. 2014. Introduces Paragraph Vector. Concatenates and averages pretrained, fixed word vectors to create vectors for sentences, paragraphs and documents. Also known as paragraph2vec. Doesn't use a parse tree.
Implemented in gensim. See doc2vec tutorial

Deep Recursive Neural Networks for Compositionality in Language
Irsoy & Cardie. 2014. Uses Deep Recursive Neural Networks. Uses a parse tree.

Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks
Tai et al. 2015 Introduces Tree LSTM. Uses a parse tree.

Semi-supervised Sequence Learning
Dai, Le 2015
Approach: "We present two approaches that use unlabeled data to improve sequence learning with recurrent networks. The first approach is to predict what comes next in a sequence, which is a conventional language model in natural language processing. The second approach is to use a sequence autoencoder..."
Result: "With pretraining, we are able to train long short term memory recurrent networks up to a few hundred timesteps, thereby achieving strong performance in many text classification tasks, such as IMDB, DBpedia and 20 Newsgroups."

Bag of Tricks for Efficient Text Classification
Joulin, Grave, Bojanowski, Mikolov 2016 Facebook AI Research.
"Our experiments show that our fast text classifier fastText is often on par with deep learning classifiers in terms of accuracy, and many orders of magnitude faster for training and evaluation."
FastText blog
FastText Code

Neural Machine Translation

In 2014, neural machine translation (NMT) performance became comprable to state of the art statistical machine translation(SMT).

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation (abstract)
Cho et al. 2014 Breakthrough deep learning paper on machine translation. Introduces basic sequence to sequence model which includes two rnns, an encoder for input and a decoder for output.

Neural Machine Translation by jointly learning to align and translate (abstract)
Bahdanau, Cho, Bengio 2014.
Implements attention mechanism. "Each time the proposed model generates a word in a translation, it (soft-)searches for a set of positions in a source sentence where the most relevant information is concentrated"
Result: "comparable to the existing state-of-the-art phrase-based system on the task of English-to-French translation."
English to French Demo

On Using Very Large Target Vocabulary for Neural Machine Translation
Jean, Cho, Memisevic, Bengio 2014.
"we try replacing each [UNK] token with the aligned source word or its most likely translation determined by another word alignment model."
Result: English -> German bleu score = 21.59 (target vocabulary of 50,000)

Sequence to Sequence Learning with Neural Networks
Sutskever, Vinyals, Le 2014. (nips presentation). Uses seq2seq to generate translations.
Result: English -> French bleu score = 34.8 (WMT’14 dataset)
A key contribution is improvements from reversing the source sentences.
seq2seq tutorial in TensorFlow.

Addressing the Rare Word Problem in Neural Machine Translation (abstract)
Luong, Sutskever, Le, Vinyals, Zaremba 2014
Replace UNK words with dictionary lookup.
Result: English -> French BLEU score = 37.5.

Effective Approaches to Attention-based Neural Machine Translation
Luong, Pham, Manning. 2015
2 models of attention: global and local.
Result: English -> German 25.9 BLEU points

Context-Dependent Word Representation for Neural Machine Translation
Choi, Cho, Bengio 2016
"we propose to contextualize the word embedding vectors using a nonlinear bag-of-words representation of the source sentence."
"we propose to represent special tokens (such as numbers, proper nouns and acronyms) with typed symbols to facilitate translating those words that are not well-suited to be translated via continuous vectors."

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
Wu et al. 2016
blog post
"WMT’14 English-to-French, our single model scores 38.95 BLEU"
"WMT’14 English-to-German, our single model scores 24.17 BLEU"

Google's Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation
Johnson et al. 2016
blog post
Translations between untrained language pairs.

Google has started rolling out NMT to it's production system, and it's a significant improvement.

Image Captioning

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
Xu et al. 2015 Creates captions by feeding image into a CNN which feeds into hidden state of an RNN that generates the caption. At each time step the RNN outputs next word and the next location to pay attention to via a probability over grid locations. Uses 2 types of attention soft and hard. Soft attention uses gradient descent and backprop and is deterministic. Hard attention selects the element with highest probability. Hard attention uses reinforcement learning, rather than backprop and is stochastic.

Open source implementation in TensorFlow

Conversation modeling / Dialog

Neural Responding Machine for Short-Text Conversation
Shang et al. 2015 Uses Neural Responding Machine. Trained on Weibo dataset. Achieves one round conversations with 75% appropriate responses.

A Neural Network Approach to Context-Sensitive Generation of Conversational Responses
Sordoni et al. 2015. Generates responses to tweets.
Uses Recurrent Neural Network Language Model (RLM) architecture of (Mikolov et al., 2010). source code: RNNLM Toolkit

Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models
Serban, Sordoni, Bengio et al. 2015. Extends hierarchical recurrent encoder-decoder neural network (HRED).

Attention with Intention for a Neural Network Conversation Model
Yao et al. 2015 Architecture is three recurrent networks: an encoder, an intention network and a decoder.

A Hierarchical Latent Variable Encoder-Decoder Model for Generating Dialogues
Serban, Sordoni, Lowe, Charlin, Pineau, Courville, Bengio 2016
Proposes novel architecture: VHRED. Latent Variable Hierarchical Recurrent Encoder-Decoder
Compares favorably against LSTM and HRED.

A Neural Conversation Model
Vinyals, Le 2015. Uses LSTM RNNs to generate conversational responses. Uses seq2seq framework. Seq2Seq was originally designed for machine translation and it "translates" a single sentence, up to around 79 words, to a single sentence response, and has no memory of previous dialog exchanges. Used in Google Smart Reply feature for Inbox

Incorporating Copying Mechanism in Sequence-to-Sequence Learning
Gu et al. 2016 Proposes CopyNet, builds on seq2seq.

A Persona-Based Neural Conversation Model
Li et al. 2016 Proposes persona-based models for handling the issue of speaker consistency in neural response generation. Builds on seq2seq.

Deep Reinforcement Learning for Dialogue Generation
Li et al. 2016. Uses reinforcement learing to generate diverse responses. Trains 2 agents to chat with each other. Builds on seq2seq.

Deep learning for chatbots
Article summary of state of the art, and challenges for chatbots.
Deep learning for chatbots. part 2
Implements a retrieval based dialog agent using dual encoder lstm with TensorFlow, based on the Ubuntu dataset [paper] includes source code

Memory and Attention Models

Attention mechanisms allows the network to refer back to the input sequence, instead of forcing it to encode all information into one fixed-length vector. - Attention and Memory in Deep Learning and NLP

Memory Networks Weston et. al 2014, and End-To-End Memory Networks Sukhbaatar et. al 2015.
Memory networks are implemented in MemNN. Attempts to solve task of reason attention and memory.
Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks
Weston 2015. Classifies QA tasks like single factoid, yes/no etc. Extends memory networks.
Evaluating prerequisite qualities for learning end to end dialog systems
Dodge et. al 2015. Tests Memory Networks on 4 tasks including reddit dialog task.
See Jason Weston lecture on MemNN

Neural Turing Machines
Graves, Wayne, Danihelka 2014.
We extend the capabilities of neural networks by coupling them to external memory resources, which they can interact with by attentional processes. The combined system is analogous to a Turing Machine or Von Neumann architecture but is differentiable end-toend, allowing it to be efficiently trained with gradient descent. Preliminary results demonstrate that Neural Turing Machines can infer simple algorithms such as copying, sorting, and associative recall from input and output examples. Olah and Carter blog on NTM

Inferring Algorithmic Patterns with Stack-Augmented Recurrent Nets
Joulin, Mikolov 2015. Stack RNN source code and blog post

Reasoning, Attention and Memory RAM workshop at NIPS 2015. slides included

zhucanxiang/DL4NLP

Deep Learning for NLP resources