Natural Language Processing


📱	Applications
📋	Pipeline
📏	Scores
👨🏻‍🏫	Transfer Learning
🤖	Transformers theory
🔮	DL Models
📦	Python Packages

📱 Applications

Application	Description	Type
🏷️ Part-of-speech tagging (POS)	Identify if each word is a noun, verb, adjective, etc. (aka Parsing).	🔤
📍 Named entity recognition (NER)	Identify names, organizations, locations, medical codes, time, etc.	🔤
👦🏻❓ Coreference Resolution	Identify several ocuurences on the same person/objet like he, she	🔤
🔍 Text categorization	Identify topics present in a text (sports, politics, etc).	🔤
❓ Question answering	Answer questions of a given text (SQuAD, DROP dataset).	💭
👍🏼 👎🏼 Sentiment analysis	Possitive or negative comment/review classification.	💭
🔮 Language Modeling (LM)	Predict the next word. Unupervised.	💭
🔮 Masked Language Modeling (MLM)	Predict the omitted words. Unupervised.	💭
🔮 Next Sentence Prediction (NSP)		💭
📗→📄 Summarization	Crate a short version of a text.	💭
🈯→🆗 Translation	Translate into a different language.	💭
🆓→🆒 Dialogue bot	Interact in a conversation.	💭
💁🏻→🔠 Speech recognition	Speech to text. See AUDIO cheatsheet.	🗣️
🔠→💁🏻 Speech generation	Text to speech. See AUDIO cheatsheet.	🗣️

🔤: Natural Language Processing (NLP)
💭: Natural Language Understanding (NLU)
🗣️: Speech and sound (speak and listen)

📋 Pipeline

Preprocess
- Tokenization: Split the text into sentences and the sentences into words.
- Lowercasing: Usually done in Tokenization
- Punctuation removal: Remove words like ., ,, :. Usually done in Tokenization
- Stopwords removal: Remove words like and, the, him. Done in the past.
- Lemmatization: Verbs to root form: organizes, will organize organizing → organize This is better.
- Stemming: Nouns to root form: democratic, democratization → democracy. This is faster.
- Subword tokenization Used in transformers. ⭐
  - WordPiece: Used in BERT
  - Byte Pair Encoding (BPE): Used in GPT-2 (2016)
  - Unigram Language Model: (2018)
  - SentencePiece: (2018)
Extract features
- Document features
  - Bag of Words (BoW): Counts how many times a word appears in a text. (It can be normalize by text lenght)
  - TF-IDF: Measures relevance for each word in a document, not frequency like BoW.
  - N-gram: Probability of N words together.
  - Sentence and document vectors. paper2014, paper2017
- Word features
  - Word Vectors: Unique representation for every word (independent of its context).
    - Word2Vec: By Google in 2013
    - GloVe: By Standford
    - FastText: By Facebook
  - Contextualized Word Vectors: Good for polysemic words (meaning depend of its context).
    - CoVE: in 2017
    - ELMO: Done with with bidirectional LSTMs. By allen Institute in 2018
    - Transformer encoder: Done with with self-attention. ⭐
Build model
- Bag of Embeddings
- Linear algebra/matrix decomposition
  - Latent Semantic Analysis (LSA) that uses Singular Value Decomposition (SVD).
  - Non-negative Matrix Factorization (NMF)
  - Latent Dirichlet Allocation (LDA): Good for BoW
- Neural nets
  - Recurrent NNs decoder (LSTM, GRU)
  - Transformer decoder (GPT, BERT, ...) ⭐
- Hidden Markov Models

Others

Regular expressions: (Regex) Find patterns.
Parse trees: Syntax od a sentence

Seq2seq

Recurent nets
- GRU
- LSTM
Tricks
- Teacher forcing: Feed to the decoder the correct previous word, insted of the predicted previous word (at the beggining of training)
- Attention: Learns weights to perform a weighted average of the words embeddings.

🤖 Transformers

Transformer input

Tokenizer: Create subword tokens. Methods: BPE...
Embedding: Create vectors for each token. Sum of:
- Token Embedding
- Positional Encoding: Information about tokens order (e.g. sinusoidal function).
Dropout

Transformer blocks (6, 12, 24,...)

Normalization
Multi-head attention layer (with a left-to-right attention mask)
- Each attention head uses self attention to process each token input conditioned on the other input tokens.
- Left-to-right attention mask ensures that only attends to the positions that precede it to the left.
Normalization
Feed forward layers:
1. Linear H→4H
2. GeLU activation func
3. Linear 4H→H

Transformer output

Normalization
Output embedding
Softmax
Label smothing: Ground truth -> 90% the correct word, and the rest 10% divided on the other words.

Lowest layers: morphology
Middle layers: syntax
Highest layers: Task-specific semantics

📏 Scores

Score	For what?	Description	Interpretation
Perplexity	LM		The lower the better.
GLUE	NLU	An avergae of different scores
BLEU	Translation	Compare generated with reference sentences (N-gram)	The higher the better.

BLEU limitation

"He ate the apple" & "He ate the potato" has the same BLEU score.

👨🏻‍🏫 Transfer Learning

Step	Task	Data	Who do this?
1	[Masked] Language Model Pretraining	📚 Lot of text corpus (eg. Wikipedia)	🏭 Google or Facebook
2	[Masked] Language Model Finetunning	📗 Only you domain text corpus	💻 You
3	Your supervised task (clasification, etc)	📗🏷️ You labeled domain text	💻 You

📦 Python Packages

Packages	Description	Type
	Parse trees, execelent tokenizer (8 languages)	🔤
	Semantic analysis, topic modeling and similarity detection.	🔤
NLTK	Very broad NLP library. Not SotA.	🔤
SentencePiece	Unsupervised text tokenizer by Google	🔤
	Fast.ai NLP: ULMFiT fine-tuning	🔤
	TorchText (Pytorch subpackage)	🔤
	Word vector representations and sentence classification (157 languages)	🔤
	pytorch-transformers: 8 pretrained Pytorch transformers	🔤
+	SpaCy + pytorch-transformers	🔤
fast-bert	Super easy library for BERT based models	🔤
	Pretrained models for 53 languages	🔤
PyText		🔤
	An open-source NLP research library, built on PyTorch.	🔤
	Fast & easy NLP transfer learning for the industry.	🔤
	NLP library designed for reproducible experimentation management.	🔤
	A very simple framework for state-of-the-art NLP.	🔤
	SotA NLP deep learning topologies and techniques.	🔤
	Scikit-learn style model finetuning for NLP.	🔤

Installation

pip install spacy
python -m spacy download en_core_web_sm
python -m spacy download es_core_news_sm
python -m spacy download es_core_news_md

Usage

import spacy

nlp = spacy.load("en_core_web_sm")  # Load English small model
nlp = spacy.load("es_core_news_sm") # Load Spanish small model without Word2Vec
nlp = spacy.load('es_core_news_md') # Load Spanish medium model with Word2Vec


text = nlp("Hola, me llamo Javi")   # Text from string
text = nlp(open("file.txt").read()) # Text from file


spacy.displacy.render(text, style='ent', jupyter=True)  # Display text entities
spacy.displacy.render(text, style='dep', jupyter=True)  # Display word dependencies

Word2Vect

es_core_news_md has 534k keys, 20k unique vectors (50 dimensions)

coche = nlp("coche")
moto  = nlp("moto")
print(coche.similarity(moto)) # Similarity based on cosine distance

coche[0].vector      # Show vector

🔮 Deep learning models ALL MODELS

🤗 Means availability (pretrained PyTorch implementation) on pytorch-transformers package developed by huggingface.

Model	Creator	Date	Breif description	Data	🤗
1st Transformer	Google	Jun. 2017	Encoder & decoder transformer with attention
ULMFiT	Fast.ai	Jan. 2018	Regular LSTM
ELMo	AllenNLP	Feb. 2018	Bidirectional LSTM
GPT	OpenAI	Jun. 2018	Transformer on LM		✔
BERT	Google	Oct. 2018	Transformer on MLM (& NSP)	16GB	✔
Transformer-XL	Google/CMU	Jan. 2019			✔
XLM/mBERT	Facebook	Jan. 2019	Multilingual LM		✔
Transf. ELMo	AllenNLP	Jan. 2019
GPT-2	OpenAI	Feb. 2019	Good text generation		✔
ERNIE	Baidu research	Apr. 2019
XLNet:	Google/CMU	Jun. 2019	BERT + Transformer-XL	130GB	✔
RoBERTa	Facebook	Jul. 2019	BERT without NSP	160GB	✔
MegatronLM	Nvidia	Aug. 2019	Big models with parallel training
DistilBERT	Hugging Face	Aug. 2019	Compressed BERT	16GB	✔
MiniBERT	Google	Aug. 2019	Compressed BERT
ALBERT	Google	Sep. 2019	Parameter reduction on BERT

https://huggingface.co/pytorch-transformers/pretrained_models.html

Model	2L	3L	6L	12L	18L	24L	36L	48L	54L	72L
1st Transformer				yes
ULMFiT		yes
ELMo	yes
GPT				110M
BERT				110M		340M
Transformer-XL					257M
XLM/mBERT			Yes	Yes
Transf. ELMo
GPT-2				117M		345M	762M	1542M
ERNIE			Yes
XLNet:				110M		340M
RoBERTa				125M		355M
MegatronLM						355M			2500M	8300M
DistilBERT			66M
MiniBERT		Yes

Attention: (Aug 2015)
- Allows the network to refer back to the input sequence, instead of forcing it to encode all information into ane fixed-lenght vector.
- Paper: Effective Approaches to Attention-based Neural Machine Translation
- blog
- attention and memory
1st Transformer: (Google AI, jun. 2017)
- Introduces the transformer architecture: Encoder with self-attention, and decoder with attention.
- Surpassed RNN's State of the Art
- Paper: Attention Is All You Need
- blog.
ULMFiT: (Fast.ai, Jan. 2018)
- Regular LSTM Encoder-Decoder architecture with no attention.
- Introduces the idea of transfer-learning in NLP:
  1. Take a trained tanguge model: Predict wich word comes next. Trained with Wikipedia corpus for example (Wikitext 103).
  2. Retrain it with your corpus data
  3. Train your task (classification, etc.)
- Paper: Universal Language Model Fine-tuning for Text Classification
ELMo: (AllenNLP, Feb. 2018)
- Context-aware embedding = better representation. Useful for synonyms.
- Made with bidirectional LSTMs trained on a language modeling (LM) objective.
- Parameters: 94 millions
- Paper: Deep contextualized word representations
- site.
GPT: (OpenAI, Jun. 2018)
- Made with transformer trained on a language modeling (LM) objective.
- Same as transformer, but with transfer-learning for ther NLP tasks.
- First train the decoder for language modelling with unsupervised text, and then train other NLP task.
- Parameters: 110 millions
- Paper: Improving Language Understanding by Generative Pre-Training
- site, code.
BERT: (Google AI, oct. 2018)
- Bi-directional training of transformer:
  - Replaces language modeling objective with "masked language modeling".
  - Words in a sentence are randomly erased and replaced with a special token ("masked").
  - Then, a transformer is used to generate a prediction for the masked word based on the unmasked words surrounding it, both to the left and right.
- Parameters:
  - BERT-Base: 110 millions
  - BERT-Large: 340 millions
- Paper: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
- Official code
- blog
- fastai alumn blog
- blog3
- slides
Transformer-XL: (Google/CMU, Jan. 2019)
- Learning long-term dependencies
- Resolved Transformer's Context-Fragmentation
- Outperforms BERT in LM
- Paper: Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
- blog
- google blog
- code.
XLM/mBERT: (Facebook, Jan. 2019)
- Multilingual Language Model (100 languages)
- SOTA on cross-lingual classification and machine translation
- Parameters: 665 millions
- Paper: Cross-lingual Language Model Pretraining
- code
- blog
Transformer ELMo: (AllenNLP, Jan. 2019)
- Parameters: 465 millions
GPT-2: (OpenAI, Feb. 2019)
- Zero-Shot task learning
- Coherent paragraphs of generated text
- Parameters: 1500 millions
- Site
- Paper: Language Models are Unsupervised Multitask Learners
ERNIE (Baidu research, Apr. 2019)
- World-aware, Structure-aware, and Semantic-aware tasks
- Continual pre-training
- Paper: ERNIE: Enhanced Representation through Knowledge Integration
XLNet: (Google/CMU, Jun. 2019)
- Auto-Regressive methods for LM
- Best both BERT + Transformer-XL
- Parameters: 340 millions
- Paper: XLNet: Generalized Autoregressive Pretraining for Language Understanding
- code
RoBERTa (Facebook, Jul. 2019)
- Facebook's improvement over BERT
- Optimized BERT's training process and hyperparameters
- Parameters:
  - RoBERTa-Base: 125 millions
  - RoBERTa-Large: 355 millions
- Trained on 160GB of text
- Paper RoBERTa: A Robustly Optimized BERT Pretraining Approach
MegatronLM (Nvidia, Aug. 2019)
- Too big
- Parameters: 8300 millions
DistilBERT (Hugging Face, Aug. 2019)
- Compression of BERT with Knowledge distillation (teacher-student learning)
- A small model (DistilBERT) is trained with the output of a larger model (BERT)
- Comparable results to BERT using less parameters
- Parameters: 66 millions

References

TO-DO read:
- Read MASS (transfer learning in translation for transformers?)
- Read CNNs better than attention
Modern NLP
Courses
- Fast.ai NLP course: playlist
- spaCy course
Transformers
- The Illustrated Transformer (June 2018)
- The Illustrated BERT & ELMo (December 2018)
- The Illustrated GPT-2 (August 2019) ⭐
- Best Transformers explanation (August 2019) ⭐
- BERT summary
- BERT, RoBERTa, DistilBERT, XLNet. Which one to use?
- DistilBERT model by huggingface
Transfer Learning in NLP by Sebastian Ruder
- Blog
- Slides ⭐
- Notebook: from scratch pytorch ⭐⭐
- Notebook2: pytorch-transformers + Fast.ai ⭐⭐
- Code (Github)
- Video
- NLP transfer learning libraries
7 NLP libraries
spaCy blog

0x01h/nlp

Natural Language Processing

📱 Applications

📋 Pipeline

Others

Seq2seq

🤖 Transformers

Transformer input

Transformer blocks (6, 12, 24,...)

Transformer output

📏 Scores

BLEU limitation

👨🏻‍🏫 Transfer Learning

📦 Python Packages

NLTK

SentencePiece

fast-bert

PyText

Installation

Usage

Word2Vect

🔮 Deep learning models ALL MODELS

References

Fast.ai NLP Videos