Subword Tokenizers

This repo explores the different subword tokenizers.

Subword tokenizers

Algorithm	Base unit	Implementations	Paper
Byte-pair encoding (BPE)	Unicode code	original implementation, FastBPE, SentencePiece repo	Neural Machine Translation of Rare Words with Subword Units
byte-level BPE	byte	HuggingFace repo, GPT2 repo	Language Models are Unsupervised Multitask Learners (GPT2)
Wordpiece	Unicode code	BERT repo	Google's Neural Machine Translation System
Unigram Language Model	Unicode code	SentencePiece repo	Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates