Reading List


NLP Papers

CS 533: Natural Language Processing (NLP)

Probing Tasks

Summarization Papers

Information Retrieval

Talks

Machine Learning, Deep Learning

Blogs

  • Pruning and Knowledge Distillation

  • XLNet and TransformerXL
    For a Transformer, this is impossible because Transformers take fixed-length sequences as input have no notion of "memory". All its computations are stateless (this was actually one of the major selling points of the Transformer: no state means computation can be parallelized) so there is an upper limit on the distance of relationships a vanilla Transformer can model. The Transformer XL is a simple extension of the Transformer that seeks to resolve this problem. The idea is simple: what if we added recurrence to the Transformer? Adding recurrence at the word level would just make it an RNN. But what if we added recurrence at a "segment" level. In other words, what if we added state between consecutive sequences of computations? The Transformer XL accomplishes this by caching the hidden states of the previous sequence and passing them as keys/values when processing the current sequence. the Transformer XL introduces the notion of relative positional embeddings. Instead of having an embedding represent the absolute position of a word, the Transformer XL uses an embedding to encode the relative distance between words. This embedding is used while computing the attention score between any two words: in other words, the relative positional embedding enables the model to learn how to compute the attention score for words that are n words before and after the current word.
    XlLnet model is forced to model bidirectional dependencies with permutation language modeling. In expectation, the model should learn to model the dependencies between all combinations of inputs in contrast to traditional language models that only learn dependencies in one direction. The conceptual difference between BERT and XLNet. XLNet learns to predict the words in an arbitrary order but in an autoregressive, sequential manner (not necessarily left-to-right). BERT predicts all masked words simultaneously. In permutation language modeling, we are not changing the actual order of words in the input sentence. We are just changing the order in which we predict them.

  • Transformer Family

  • Paper Dissected: “Attention is All You Need” Explained

  • Paper Dissected: “BERT” Explained

  • A quick summary of modern NLP methods

  • Complete Modern NLP Survey

  • NLP Pretraining

  • NLP Applications

  • When Not to Choose the Best NLP Model

  • Are Sixteen Heads Really Better than One?

  • NLP Year in Review - 2019

  • Language Models

  • RNN – Andrej Karpathy’s blog The Unreasonable Effectiveness of Recurrent Neural Networks

  • LSTM – Christopher Olah’s blog Understanding LSTM Networks and R2Rt.com Written Memories: Understanding, Deriving and Extending the LSTM
    Use of RNN (Sequential data): when we don’t need any further context – it’s pretty obvious the next word is going to be "sky". In such cases, where the gap between the relevant information and the place that it’s needed is small, RNNs can learn to use the past information. Unfortunately, as that gap grows, RNNs become unable to learn to connect the information. “LSTMs” – are a special kind of RNN, capable of learning long-term dependencies. The cell state is kind of like a conveyor belt. It runs straight down the entire chain, with only some minor linear interactions. It’s very easy for information to just flow along it unchanged. The LSTM does have the ability to remove or add information to the cell state, carefully regulated by structures called gates. helps in gradient flow and structured gates helps adds more flexibility to the model.

  • The Annotated Transformer

  • Attention? Attention!

  • Attention – Christopher Olah Attention and Augmented Recurrent Neural Networks
    Discusses use of attention for various applications like translation, image captioning and audio transcribing

  • Attention basics

  • Attenion is not not explaination

  • Seq2Seq - Nathan Lintz Sequence Modeling With Neural Networks
    Using of Seq2Seq: Since the decoder model sees an encoded representation of the input sequence as well as the translation sequence, it can make more intelligent predictions about future words based on the current word. For example, in a standard language model, we might see the word “crane” and not be sure if the next word should be about the bird or heavy machinery. However, if we also pass an encoder context, the decoder might realize that the input sequence was about construction, not flying animals. Given the context, the decoder can choose the appropriate next word and provide more accurate translations.
    Without attention: Unfortunately, compressing an entire input sequence into a single fixed vector tends to be quite challenging. And, the context is biased towards the end of the encoder sequence, and might miss important information at the start of the sequence.
    This mechanism will hold onto all states from the encoder and give the decoder a weighted average of the encoder states for each element of the decoder sequence. Now, the decoder can take “glimpses” into the encoder sequence to figure out which element it should output next. Our decoder network can now use different portions of the encoder sequence as context while it’s processing the decoder sequence, instead of using a single fixed representation of the input sequence. This allows the network to focus on the most important parts of the input sequence instead of the whole input sequence, therefore producing smarter predictions for the next word in the decoder sequence. Helps in better backpropagation to diff encoder states

  • The Transformer – Attention is all you need

  • Transformer Google Blog Need for transformer: Recurrent models due to sequential nature (computations focused on the position of symbol in input and output) are not allowing for parallelization along training, thus have a problem with learning long-term dependencies from memory.
    Constraint of sequential computation: attempted by CNN models. However, in those CNN-based approaches, the number of calculations in parallel computation of the hidden representation, for input→output position in sequence, grows with the distance between those positions. The complexity of O(n) for ConvS2S and O(nlogn) for ByteNet makes it harder to learn dependencies on distant positions.
    Transformer reduces the number of sequential operations to relate two symbols from input/output sequences to a constant O(1) number of operations. Transformer achieves this with the multi-head attention mechanism that allows to model dependencies regardless of their distance in input or output sentence.
    The novel approach of Transformer is however, to eliminate recurrence completely and replace it with attention to handle the dependencies between input and output. The Transformer moves the sweet spot of current ideas toward attention entirely. It eliminates the not only recurrence but also convolution in favor of applying self-attention (a.k.a intra-attention). Additionally Transformer gives more space for parallelization. Transformer is claimed by authors to be the first to rely entirely on self-attention to compute representations of input and output. The encoder-decoder model is designed at its each step to be auto-regressive - i.e. use previously generated symbols as extra input while generating next symbol. Thus, xi+yi−1→yi
    In each step, it applies a self-attention mechanism which directly models relationships between all words in a sentence, regardless of their respective position. In the earlier example “I arrived at the bank after crossing the river”, to determine that the word “bank” refers to the shore of a river and not a financial institution, the Transformer can learn to immediately attend to the word “river” and make this decision in a single step.

    Positional emb: In paper authors have decided on fixed variant using sin and cos functions to enable the network to learn information about tokens relative positions to the sequence. Of course authors motivate the use of sinusoidal functions due to enabling model to generalize to sequences longer than ones encountered during training.

    Transformer reduces the number of operations required to relate (especially distant) positions in input and output sequence to a O(1). However, this comes at cost of reduced effective resolution because of averaging attention-weighted positions. To reduce this cost authors propose the multi-head attention.

    Self-attention: In encoder, self-attention layers process input queries,keys and values that comes form same place i.e. the output of previous layer in encoder. Each position in encoder can attend to all positions from previous layer of the encoder.

    In encoder phase (shown in the Figure 1.), transformer first generates initial representation/embedding for each word in input sentence (empty circle). Next, for each word, self-attention aggregates information form all other words in context of sentence, and creates new representation (filled circles). The process is repeated for each word in sentence. Successively building new representations, based on previous ones is repeated multiple times and in parallel for each word (next layers of filled circles).

    Decoder acts similarly generating one word at a time in a left-to-right-pattern. It attends to previously generated words of decoder and final representation of encoder.

  • Reformer Google Blog

  • Reformer Blog focusing primarily on how the self-attention operation scales with sequence length, and proposing an alternative attention mechanism to incorporate information from much longer contexts into language models.

  • BERT Google Blog deeply bidirectional vs ELMO (shallow way to plugging two representations together)
    Why does this matter? Pre-trained representations can either be context-free or contextual, and contextual representations can further be unidirectional or bidirectional. Context-free models such as word2vec or GloVe generate a single word embedding representation for each word in the vocabulary. For example, the word “bank” would have the same context-free representation in “bank account” and “bank of the river.” Contextual models instead generate a representation of each word that is based on the other words in the sentence. For example, in the sentence “I accessed the bank account,” a unidirectional contextual model would represent “bank” based on “I accessed the” but not “account.” However, BERT represents “bank” using both its previous and next context — “I accessed the ... account” — starting from the very bottom of a deep neural network, making it deeply bidirectional.

  • GPT-2 OpenAI Blog

  • GPT-2 Blog

  • ELECTRA Google Blog

  • Self-supervised learning

  • Autoregressive Models At each step the model is auto-regressive, consuming the previously generated symbols as additional input when generating the next.

  • A Survey of Long-Term Context in Transformers

Resources

Topics

Self Supervised Learning
SVM, Kernel and kernel Functions
K-means, PCA, SVD
Bagging Boosting
Feature Selection
Model Selection
Optimization Algorithms
HMM
Transformer
Active Learning
Dependency Parsing
POS tagging
AdaBoast, AdaGrad, Ensembles: Check ML/NLP whatsapp group
Random Forests