Subword Tokenizers
This repo explores the different subword tokenizers.
Subword tokenizers
Algorithm | Base unit | Implementations | Paper |
---|---|---|---|
Byte-pair encoding (BPE) | Unicode code | original implementation, FastBPE, SentencePiece repo | Neural Machine Translation of Rare Words with Subword Units |
byte-level BPE | byte | HuggingFace repo, GPT2 repo | Language Models are Unsupervised Multitask Learners (GPT2) |
Wordpiece | Unicode code | BERT repo | Google's Neural Machine Translation System |
Unigram Language Model | Unicode code | SentencePiece repo | Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates |
|
Large Pretrained Language Models and their tokenizers:
Model | Repo | Tokenizer |
---|---|---|
BERT (Google) | GitHub link | WordPiece |
GPT2 (OpenAI) | GitHub link | byte-level BPE |
RoBERTa (Facebook) | GitHub link | byte-level BPE |
Transformer-XL (CMU) | GitHub link | words |
XLM (Facebook) | GitHub link | BPE |
XLNet (CMU) | GitHub link | BPE (from SentencePiece) |
CTRL (Salesforce) | GitHub link | BPE (from fastBPE) |