Subword Tokenizers

This repo explores the different subword tokenizers.

Subword tokenizers

Algorithm Base unit Implementations Paper
Byte-pair encoding (BPE) Unicode code original implementation, FastBPE, SentencePiece repo Neural Machine Translation of Rare Words with Subword Units
byte-level BPE byte HuggingFace repo, GPT2 repo Language Models are Unsupervised Multitask Learners (GPT2)
Wordpiece Unicode code BERT repo Google's Neural Machine Translation System
Unigram Language Model Unicode code SentencePiece repo Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates

|

Large Pretrained Language Models and their tokenizers:

Model Repo Tokenizer
BERT (Google) GitHub link WordPiece
GPT2 (OpenAI) GitHub link byte-level BPE
RoBERTa (Facebook) GitHub link byte-level BPE
Transformer-XL (CMU) GitHub link words
XLM (Facebook) GitHub link BPE
XLNet (CMU) GitHub link BPE (from SentencePiece)
CTRL (Salesforce) GitHub link BPE (from fastBPE)