/nlp

🔠 Deep Learning for Natural Language Processing

Natural Language Processing

📱 Applications
📋 Pipeline
📏 Scores
👨🏻‍🏫 Transfer Learning
🤖 Transformers theory
🔮 DL Models
📦 Python Packages

📱 Applications

Application Description Type
🏷️ Part-of-speech tagging (POS) Identify if each word is a noun, verb, adjective, etc. (aka Parsing). 🔤
📍 Named entity recognition (NER) Identify names, organizations, locations, medical codes, time, etc. 🔤
👦🏻❓ Coreference Resolution Identify several ocuurences on the same person/objet like he, she 🔤
🔍 Text categorization Identify topics present in a text (sports, politics, etc). 🔤
Question answering Answer questions of a given text (SQuAD, DROP dataset). 💭
👍🏼 👎🏼 Sentiment analysis Possitive or negative comment/review classification. 💭
🔮 Language Modeling (LM) Predict the next word. Unupervised. 💭
🔮 Masked Language Modeling (MLM) Predict the omitted words. Unupervised. 💭
🔮 Next Sentence Prediction (NSP) 💭
📗→📄 Summarization Crate a short version of a text. 💭
🈯→🆗 Translation Translate into a different language. 💭
🆓→🆒 Dialogue bot Interact in a conversation. 💭
💁🏻→🔠 Speech recognition Speech to text. See AUDIO cheatsheet. 🗣️
🔠→💁🏻 Speech generation Text to speech. See AUDIO cheatsheet. 🗣️
  • 🔤: Natural Language Processing (NLP)
  • 💭: Natural Language Understanding (NLU)
  • 🗣️: Speech and sound (speak and listen)

📋 Pipeline

  1. Preprocess
    • Tokenization: Split the text into sentences and the sentences into words.
    • Lowercasing: Usually done in Tokenization
    • Punctuation removal: Remove words like ., ,, :. Usually done in Tokenization
    • Stopwords removal: Remove words like and, the, him. Done in the past.
    • Lemmatization: Verbs to root form: organizes, will organize organizingorganize This is better.
    • Stemming: Nouns to root form: democratic, democratizationdemocracy. This is faster.
    • Subword tokenization Used in transformers. ⭐
  2. Extract features
    • Document features
      • Bag of Words (BoW): Counts how many times a word appears in a text. (It can be normalize by text lenght)
      • TF-IDF: Measures relevance for each word in a document, not frequency like BoW.
      • N-gram: Probability of N words together.
      • Sentence and document vectors. paper2014, paper2017
    • Word features
      • Word Vectors: Unique representation for every word (independent of its context).
        • Word2Vec: By Google in 2013
        • GloVe: By Standford
        • FastText: By Facebook
      • Contextualized Word Vectors: Good for polysemic words (meaning depend of its context).
        • CoVE: in 2017
        • ELMO: Done with with bidirectional LSTMs. By allen Institute in 2018
        • Transformer encoder: Done with with self-attention. ⭐
  3. Build model
    • Bag of Embeddings
    • Linear algebra/matrix decomposition
      • Latent Semantic Analysis (LSA) that uses Singular Value Decomposition (SVD).
      • Non-negative Matrix Factorization (NMF)
      • Latent Dirichlet Allocation (LDA): Good for BoW
    • Neural nets
      • Recurrent NNs decoder (LSTM, GRU)
      • Transformer decoder (GPT, BERT, ...) ⭐
    • Hidden Markov Models

Others

  • Regular expressions: (Regex) Find patterns.
  • Parse trees: Syntax od a sentence

Seq2seq

  • Recurent nets
    • GRU
    • LSTM
  • Tricks
    • Teacher forcing: Feed to the decoder the correct previous word, insted of the predicted previous word (at the beggining of training)
    • Attention: Learns weights to perform a weighted average of the words embeddings.

🤖 Transformers

Transformer input

  1. Tokenizer: Create subword tokens. Methods: BPE...
  2. Embedding: Create vectors for each token. Sum of:
    • Token Embedding
    • Positional Encoding: Information about tokens order (e.g. sinusoidal function).
  3. Dropout

Transformer blocks (6, 12, 24,...)

  1. Normalization
  2. Multi-head attention layer (with a left-to-right attention mask)
    • Each attention head uses self attention to process each token input conditioned on the other input tokens.
    • Left-to-right attention mask ensures that only attends to the positions that precede it to the left.
  3. Normalization
  4. Feed forward layers:
    1. Linear H→4H
    2. GeLU activation func
    3. Linear 4H→H

Transformer output

  1. Normalization
  2. Output embedding
  3. Softmax
  4. Label smothing: Ground truth -> 90% the correct word, and the rest 10% divided on the other words.
  • Lowest layers: morphology
  • Middle layers: syntax
  • Highest layers: Task-specific semantics

📏 Scores

Score For what? Description Interpretation
Perplexity LM The lower the better.
GLUE NLU An avergae of different scores
BLEU Translation Compare generated with reference sentences (N-gram) The higher the better.

BLEU limitation

"He ate the apple" & "He ate the potato" has the same BLEU score.

👨🏻‍🏫 Transfer Learning

Step Task Data Who do this?
1 [Masked] Language Model Pretraining 📚 Lot of text corpus (eg. Wikipedia) 🏭 Google or Facebook
2 [Masked] Language Model Finetunning 📗 Only you domain text corpus 💻 You
3 Your supervised task (clasification, etc) 📗🏷️ You labeled domain text 💻 You

📦 Python Packages

Packages Description Type
Parse trees, execelent tokenizer (8 languages) 🔤
Semantic analysis, topic modeling and similarity detection. 🔤

NLTK

Very broad NLP library. Not SotA. 🔤

SentencePiece

Unsupervised text tokenizer by Google 🔤
Fast.ai NLP: ULMFiT fine-tuning 🔤
TorchText (Pytorch subpackage) 🔤
Word vector representations and sentence classification (157 languages) 🔤
pytorch-transformers: 8 pretrained Pytorch transformers 🔤
+ SpaCy + pytorch-transformers 🔤

fast-bert

Super easy library for BERT based models 🔤
Pretrained models for 53 languages 🔤

PyText

🔤
An open-source NLP research library, built on PyTorch. 🔤
Fast & easy NLP transfer learning for the industry. 🔤
NLP library designed for reproducible experimentation management. 🔤
A very simple framework for state-of-the-art NLP. 🔤
SotA NLP deep learning topologies and techniques. 🔤
Scikit-learn style model finetuning for NLP. 🔤

Installation

pip install spacy
python -m spacy download en_core_web_sm
python -m spacy download es_core_news_sm
python -m spacy download es_core_news_md

Usage

import spacy

nlp = spacy.load("en_core_web_sm")  # Load English small model
nlp = spacy.load("es_core_news_sm") # Load Spanish small model without Word2Vec
nlp = spacy.load('es_core_news_md') # Load Spanish medium model with Word2Vec


text = nlp("Hola, me llamo Javi")   # Text from string
text = nlp(open("file.txt").read()) # Text from file


spacy.displacy.render(text, style='ent', jupyter=True)  # Display text entities
spacy.displacy.render(text, style='dep', jupyter=True)  # Display word dependencies

Word2Vect

es_core_news_md has 534k keys, 20k unique vectors (50 dimensions)

coche = nlp("coche")
moto  = nlp("moto")
print(coche.similarity(moto)) # Similarity based on cosine distance

coche[0].vector      # Show vector

🔮 Deep learning models ALL MODELS

models

🤗 Means availability (pretrained PyTorch implementation) on pytorch-transformers package developed by huggingface.

Model Creator Date Breif description Data 🤗
1st Transformer Google Jun. 2017 Encoder & decoder transformer with attention
ULMFiT Fast.ai Jan. 2018 Regular LSTM
ELMo AllenNLP Feb. 2018 Bidirectional LSTM
GPT OpenAI Jun. 2018 Transformer on LM
BERT Google Oct. 2018 Transformer on MLM (& NSP) 16GB
Transformer-XL Google/CMU Jan. 2019
XLM/mBERT Facebook Jan. 2019 Multilingual LM
Transf. ELMo AllenNLP Jan. 2019
GPT-2 OpenAI Feb. 2019 Good text generation
ERNIE Baidu research Apr. 2019
XLNet: Google/CMU Jun. 2019 BERT + Transformer-XL 130GB
RoBERTa Facebook Jul. 2019 BERT without NSP 160GB
MegatronLM Nvidia Aug. 2019 Big models with parallel training
DistilBERT Hugging Face Aug. 2019 Compressed BERT 16GB
MiniBERT Google Aug. 2019 Compressed BERT
ALBERT Google Sep. 2019 Parameter reduction on BERT

https://huggingface.co/pytorch-transformers/pretrained_models.html

Model 2L 3L 6L 12L 18L 24L 36L 48L 54L 72L
1st Transformer yes
ULMFiT yes
ELMo yes
GPT 110M
BERT 110M 340M
Transformer-XL 257M
XLM/mBERT Yes Yes
Transf. ELMo
GPT-2 117M 345M 762M 1542M
ERNIE Yes
XLNet: 110M 340M
RoBERTa 125M 355M
MegatronLM 355M 2500M 8300M
DistilBERT 66M
MiniBERT Yes

References

Fast.ai NLP Videos

  1. What is NLP?
  2. Topic Modeling with SVD & NMF
  3. Topic Modeling & SVD revisited
  4. Sentiment Classification with Naive Bayes
  5. Sentiment Classification with Naive Bayes & Logistic Regression, contd.
  6. Derivation of Naive Bayes & Numerical Stability
  7. Revisiting Naive Bayes, and Regex
  8. Intro to Language Modeling
  9. Transfer learning
  10. ULMFit for non-English Languages
  11. Understanding RNNs
  12. Seq2Seq Translation
  13. Word embeddings quantify 100 years of gender & ethnic stereotypes
  14. Text generation algorithms
  15. Implementing a GRU
  16. Algorithmic Bias
  17. Introduction to the Transformer
  18. The Transformer for language translation
  19. What you need to know about Disinformation