/Stanford-CS224n-NLP

The course notes about Stanford CS224n Natural Language Processing with Deep Learning Winter 2019 (using PyTorch)

Primary LanguageJavaScript

Stanford CS224n Natural Language Processing with Deep Learning

The course notes about Stanford CS224n Winter 2019 (using PyTorch)

Some general notes I'll write in my Deep Learning Practice repository

Course Related Links

Schedule

Week Lectures Assignments
2019/7/1~7/7 Introduction and Word Vectors, Word Vectors 2 and Word Senses Assignment 1
2019/7/8~7/14 Word Window Classification, Neural Networks, and Matrix Calculus -
2019/7/15~7/21 Backpropagation and Computation Graphs Assignment 2
2019/10/21~10/27 Linguistic Structure: Dependency Parsing -
2019/10/28~11/3 Recurrent Neural Networks and Language Models Assignment 3
2019/11/4~11/10 Vanishing Gradients and Fancy RNNs, Machine Translation, Seq2Seq and Attention Assignment 4
2019/11/11~11/17 Transformers and Self-Attention For Generative Models, Modeling contexts of use: Contextual Representations and Pretraining -
2019/11/18~11/24 Practical Tips for Projects, Question Answering, ConvNets for NLP, Subword Models Assignment 5
2019/11/25~12/1 [Project: Question Answering], Natural Language Generation -
2019/12/2~12/8 [Project: Question Answering] -
2019/12/9~12/15 Reference in Language and Coreference Resolution -
2020/1/13~1/19 Multitask Learning: A general model for NLP? -

Lecture

  1. Introduction and Word Vectors
  2. Word Vectors 2 and Word Senses
  3. Word Window Classification, Neural Networks, and Matrix Calculus
  4. Backpropagation and Computation Graphs
  5. Linguistic Structure: Dependency Parsing
  6. The probability of a sentence? Recurrent Neural Networks and Language Models
  7. Vanishing Gradients and Fancy RNNs
  8. Machine Translation, Seq2Seq and Attention
  9. Practical Tips for Final Projects - Default Final Project
  10. Question Answering and the Default Final Project - Default Final Project
  11. ConvNets for NLP
  12. Information from parts of words: Subword Models - Assignment 5
  13. Modeling contexts of use: Contextual Representations and Pretraining - ELMo, BERT
  14. Transformers and Self-Attention For Generative Models - Self-attention, Transformer
  15. Natural Language Generation
  16. Reference in Language and Coreference Resolution
  17. Multitask Learning: A general model for NLP?
  18. Constituency Parsing and Tree Recursive Neural Networks - TODO
  19. Safety, Bias, and Fairness
  20. Future of NLP + Deep Learning

Assignment

  1. Exploring Word Vectors
  2. word2vec
    1. code
    2. written
  3. Dependency Parsing
    1. code
    2. written
  4. Nerual Machine Translation
    1. code
    2. written
  5. Character-based Neural Machine Translation
    1. code
    2. written - TODO

Project

  1. Question Answering (Default)
  2. Summerization

Paper reading

  • word2vec
  • negative sampling
  • GloVe
  • improveing distrubutional similarity
  • embedding evaluation methods
  • Transformer
  • ELMo
  • BERT
  • fastText

Derivation

  • backprop

Lectures

Lecture 1: Introduction and Word Vectors

Outline

  • Introduction to Word2vec
    • objective function
    • prediction function
    • how to train it
  • Optimization: Gradient Descent & Chain Rule

Lecture 2: Word Vectors 2 and Word Senses

Outline

  • More detail to Word2vec
    • Skip-grams (SG)
    • Continuous Bag of Words (CBOW)
  • Similarity visualization
  • Co-occurrence matrix + SVD (LSA) vs. Embedding
  • Evaluation on word vectors
    • Intrinsic
    • Extrinsic

CS 168 The Modern Algorithmic Toolbox - for SVD

Lecture 3: Word Window Classification, Neural Networks, and Matrix Calculus

Outline

  • Some basic idea of NLP tasks
  • Matrix Calculus
    • Jacobian Matrix
    • Shape convention
  • Loss
    • Softmax
    • Cross-entropy

Lecture 4: Backpropagation and Computation Graphs

Outline

  • Computational Graph
  • Backprop & Forwardprop
  • Introducing regularization to prevent overfitting
  • Non-linearity: activation functions
  • Practical Tips
    • Parameter Initialization
    • Optimizers
      • plain SGD
      • more sophisticated adaptive optimizers
    • Learing Rates

Lecture 5: Linguistic Structure: Dependency Parsing

Outline

  • Methods of Dependency Parsing
    • Dynamic Programming
      • complexity O(n³)
    • Graph Algorithm
      • create a minimum spanning tree for a sentence
    • Constraint Satisfaction
      • edges are eliminated that don't satisfy hard constraints
    • Transition-based Parsing / Deterministic Dependency Parsing
      • greedy choice of attachments guided by machine learning classifier
      • complexity O(n)
  • Operations of the Shift-reduce Parser
    • Shift
    • Left-Arc
    • Right-Arc
  • Attachment Errors
    • Prepositional Phrase Attachment Errors
    • Verb Phrase Attachment Errors
    • Modifier Attachment Errors
    • Coordination Attachment Errors

mentioned CS103, CS228

Lecture 6: The probability of a sentence? Recurrent Neural Networks and Language Models

  • N-gram Language Model
  • Fixed-window Neural Language Model
  • vanilla RNN
  • Language Modeling: the task of predicting the next word, given the words so far
  • Language Model: a system that produces the probability distribution for the next candidate word
  • Conditional Language Modeling: the task of predicting the next word, given the words so far, and also some other input x
    • Machine Translation (x=source sentence, y=target sentence)
    • Summarization (x=input text, y=summarized text)
    • Dialogue (x=dialogue history, y=next utterance)
    • ...

Lecture 7: Vanishing Gradients and Fancy RNNs

Vanishing gradient =>

  • LSTM and GRU

Lecture 8: Machine Translation, Seq2Seq and Attention

  • Training method: Teacher Forcing
    • During training, we feed the gold (aka reference) target sentence into the decoder, regardless of what the decoder predicts.
  • During testing (decoding): Beam Search vs. Greedy Decoding
    • Decoding Algorithm: an algorithm you use to generate text from your language model
      • Greedy Decoding => lack of backtracking
        • on each step take the most probable word (i.e. argmax)
        • use that as the next word, and feed it as input on the next step
        • keep going until you produce <END> or reach some max length
      • Beam Search: aims to find high-probability sequence by tracking multiple possible sequences at once
        • on each step of decoder, keep track of the k (beam size) most probable partial sequences (hypotheses)
        • after you reach some stopping criterion (get n complete hypotheses (each stop when reach max depth, produce <END>)), choose the sequence with the highest probability (with score normalization)

Lecture 13: Modeling contexts of use: Contextual Representations and Pretraining

ELMo, BERT

Lecture 14: Transformers and Self-Attention For Generative Models

guest lecture

Self-attention, Transformer

Lecture 9: Practical Tips for Final Projects

Vanishing Gradient, LSTM, GRU (again)

Lecture 10: Question Answering and the Default Final Project

some more Attention, mentioned CS 276: Information Retrieval and Web Search

Quick notes about QA:

  • QA types
    • Factoid QA: answer is an NER (some clear semantic type entity)
    • Extractive QA: answer must be a span (a sub-sequence of words) in the passage
      • e.g. SQuAD 1.X
      • defect: all questions have an answer in the paragraph => turned into a kind of a ranking task
    • Extractive QA + NoAnswer: some question might have no answer in the paragraph
      • e.g. SQuAD 2.0
      • limitation:
        • only span-based answers (no yes/no, counting, implicit why)
        • ...
    • Open-domain QA

Lecture 11: ConvNets for NLP

mentioned CS231n: Convolutional Neural Networks for Visual Recognition

Lot of common technique (nowadays)

  • Model Comparison
    • Bag of Vectors: take the word vectors and averaging them
      • good baseline
      • better have followed by a few ReLU
    • Window Model
      • good for single word classification (for problems that don't need wide context e.g. POS, NER)
    • CNNs
      • good for classification
      • need zero padding for shorter phrases
      • easy to parallelize
    • RNNs
      • cognitively plausible (reading from left to right)
      • not best for classification (if just use last state)
      • much slower than CNNs
      • good for sequence tagging
      • great for language models and can be amazing with attention mechanism
  • Dropout
    • for regularization => prevent overfitting
    • gives 2~4% accuracy improvement
  • Gated units used vertically: shortcut connection (is needed for very deep networks to work)
    • Residual block
    • Highway block
  • BatchNorm
    • Z-transform: zero mean and unit variance

Lecture 12: Information from parts of words: Subword Models

fastText

Lecture 15: Natural Language Generation

Outline

  • Decoding mehtods
    • Greedy decoding
    • Beam search
    • Sampling-based decoding: good for open-ended/creative generation (poetry, stories)
      • Pure sampling: like greedy decoding, but sample instead of argmax
      • Top-n sampling: like pure sampling, but truncate the probability distribution

Softmax temperature: another way to control diversity

  • NLG Tasks
    • Machine Translation
    • (Abstractive) Summarization
      • Evaluation: ROUGE
    • Dialogue
      • chit-chat
      • task-based
    • Creative writing
      • Storytelling
      • Poetry-generation
    • Freefrom Question Answering
    • Image captioning
    • ...
  • NLG Evaluation Metrics
    • Word overlap based metrics
      • BLEU
      • ROUGE
      • METEOR
      • F1
      • ...
    • (Perplexity) doesn't tell you anything about generation
    • Word embedding based metrics
    • Human evaluation

Lecture 16: Reference in Language and Coreference Resolution

Outline

  • Coreference Resolution: identify all mentions that refer to the same real world entity
    • Application
      • Full text understanding
      • Machine translation
      • Dialogue systems
    • Step (Pipelined system)
      1. Detect the mentions => using other NLP system
      2. Cluster the mentions
    • End-to-end system
    • Model
      • Rule-based (pronomial anaphora resolution)
        • can't solve sentences which have identical syntactic structure
      • Mention Pair
        • binary classifier: coreferent or not (for every pair of mentions)
        • custering
          1. pick a threshold and add coreference links when above
          2. take the transitive closure to get the clustering
      • Mention Ranking
        1. assign each mention its highest scoring candidate antecedent
        2. add dummy NA mention at the front (for decline linking)
      • Clustering
        • Agglomerative clustering
          1. start with each mention in its own singleton cluster
          2. merge a pair of clusters at each step
  • Mention: span of text referring to some entity
    1. pronouns
      • capture use a part-of-speech tagger
    2. named entities
      • capture use a NER system
    3. noun phrases
      • capture use a parser (especially a constituency parser)
  • Linguistics stuff
    • Coreference: two mentions refer to the same entity in the world
    • Anaphora: when a term (anaphor) refers to another term (antecedent)
      • Pronominal Anaphora (Coreferential one)
      • Bridging Anaphora (Not Coreferential)
    • Cataphora: when antecedent comes after (usually before) the anaphor

Lecture 17: Multitask Learning: A general model for NLP

Outline

  • Natural Language Decathlon (decaNLP)
  • 3 equivalent supertasks of NLP
    • Language Modeling
      • predict next word
      • embedding...
    • Question Answering Formalism (Multitask Learning as QA) => Training single question answering model for multiple NLP tasks (aka. questions)
      • Question Answering
      • Machine Translation
      • Summarization
      • Natural Language Inference
      • Sentiment Classification
      • Semantic Role Labeling
      • Relation Extraction
      • Dialogue
      • Semantic Parsing
      • Commonsense Reasoning
    • Dialogue
  • Framework for tackling
    • more general language understanding
    • multitask learning
    • domain adaptation
    • transfer learning
    • weight sharing, pre-training, fine-tuning (towards ImageNet-CNN of NLP)
    • zero-shot learning

Assignments

Assignment 1: Exploring Word Vectors

Outline

  • co-occurrance matrix + Truncated SVD
  • pre-trained word2vec

Assignment 2: word2vec

  • handout
  • directory
    • written
    • code
      • python3 word2vec.py check the correctness of word2vec
      • python3 sgd.py check the correctness of SGD
      • ./get_datasets.sh; python3 run.py - training took 9480 seconds

Outline

  • Train word2vec with skip-gram model and negative sampling using stochastic gradient descent

Related

Others' Answer

Assignment 3: Dependency Parsing

A Fast and Accurate Dependency Parser using Neural Networks

  • handout
  • directory
    • written
    • code
      • python3 parser_transitions.py part_c check the corretness of transition mechanics
      • python3 parser_transitions.py part_d check the correctness of minibatch parse
      • python3 run.py
        • set debug=True to test the process (debug_out.log)
        • set debug=False to train on the entire dataset (train_out.log)
          • best UAS on the dev set: 88.79 (epoch 9/10)
          • best UAS on the test set: 89.27

Outline

  • Adam Optimizer
  • Dropout
  • Neural Transition-based Dependency Parser (a shift-reduce parser)

Others' Answer

Assignment 4: Neural Machine Translation

  • handout
  • Asure Guide (Google Drive), Practical Guide to VMs (Google Drive)
  • directory
    • written - BLEU Verify
    • code
      • python3 sanity_check.py 1d check the correctness of encode procedure (including utils.pad_sents)
      • python3 sanity_check.py 1e check the correctness of decode procedure (including step function)
      • Preprocess the training data by sh run.sh vocab to get the necessary vocabulary
      • Test the functionality on CPU: train sh run.sh train_local; test sh run.sh test_local
        • (speed about 100 words/sec on Macbook Air 1.8GHz i5 CPU)
      • Train and Test with GPU: train sh run.sh train; test sh run.sh test
        • (speed about 5000 words/sec on Nvidia GeForce GTX 1080 GPU)
        • (this will generate model image model.bin and optimizers' state model.bin.optim)
        • early stop on epoch 13, iter 86000, cum. loss 28.94, cum. ppl 5.13 cum. examples 64000 => Corpus BLEU: 22.36579929869114
      • Compare output with references vim -dO outputs/test_outputs.txt en_es_data/test.en
      • Open three of them at the same time vim -o outputs/test_outputs.txt en_es_data/test.en en_es_data/test.es

Other's Answer

Assignment 5: Character-based Neural Machine Translation

build a character level ConvNet

  • handout
  • directory
    • written
    • code
      • Create the correct vocab files sh run.sh vocab
        • vocab_tiny_q1.json: generated vocabulary, source 132 words, target 132 words
          • source: number of word types: 128, number of word types w/ frequency >= 1: 128
          • target: number of word types: 130, number of word types w/ frequency >= 1: 130
        • vocab_tiny_q2.json: generated vocabulary, source 26 words, target 32 words
          • source: number of word types: 128, number of word types w/ frequency >= 2: 22
          • target: number of word types: 130, number of word types w/ frequency >= 2: 30
        • vocab.json: generated vocabulary, source 50004 words, target 50002 words
          • source: number of word types: 172418, number of word types w/ frequency >= 2: 80623
          • target: number of word types: 128873, number of word types w/ frequency >= 2: 64215
      • Sanity Checks python3 sanity_check.py [part]
        • pre-defined: (1e, 1f, 1j, 2a, 2b, 2c, 2d)
        • customized: (1g, 1h, 1i, 1j)
      • Test the first part code at local
        • sh run.sh train_local_q1 - this will run 100 epoches
          • epoch 100, iter 500, cum. loss 0.31, cum. ppl 1.02 cum. examples 200
          • validation: iter 500, dev. ppl 1.003381
        • sh run.sh test_local_q1 - the model should overfit => Corpus BLEU: 99.29792465574434 (> 99)
          • this will generate outputs/test_outputs_local_q1.txt
      • Test the second part code at local
        • sh run.sh train_local_q2
          • epoch 200, iter 1000, cum. loss 0.26, cum. ppl 1.01 cum. examples 200
          • validation: iter 1000, dev. ppl 1.003469
        • sh run.sh test_local_q2 - the model should overfit => Corpus BLEU: 99.29792465574434
          • this will generate outputs/test_outputs_local_q2.txt
      • Train the model with sh run.sh train and test the performance with sh run.sh test
        • epoch 29, iter 196330, avg. loss 90.37, avg. ppl 147.15 cum. examples 10537, speed 3512.25 words/sec, time elapsed 29845.45 sec
        • reached maximum number of epochs! => Corpus BLEU: 24.20035238301319

TODO:

  • Enrich the sanity check of the Highway
  • Enrich the sanity check of the CNN
  • Compare the output with Assignment 4 (especially the <unk> words)
  • Written part

Projects

Question Answering on SQuAD

SQuAD is NOT an Natural Language Generation task. (since the answer is extracted from text.)

Default final project

Summerization

  • Dataset
  • Metrics
    • Rouge (Recall-Oriented Understudy for Gisting Evaluation)
    • with small scale human eval
  • Baseline
    • Simplest model
      • Logistic Regression on unigrams and bigrams
      • Averaging word vectors
    • Lede-3 baseline

Book

O'Reilly Natural Language Processing with PyTorch

Recommend in Lecture 11


PyTorch notes