Train, evaluate and analyze BPE tokenizers.
- source code: https://github.com/flxst/gpt-sw3-tokenizer
- documentation: https://flxst.github.io/gpt-sw3-tokenizer
- paper: https://arxiv.org/abs/2304.14780
git clone https://github.com/flxst/gpt-sw3-tokenizer.git
pip install -r requirements.txt
This repository provides easy-to-use tools to sample (weighted) data and subsequently train, evaluate and analyze a tokenizer.
Sampling
Training
Evaluation
Analysis
- customizable amount of (disjunct) sampled data for training and evaluation
- weighting of different categories and languages
- support for SentencePiece and HuggingFace
- customizable tokenizer features (vocabulary size, handling of whitespace and numbers, ..)
- computation of common tokenizer metrics (unknown rate, fertility, proportion of continued words, ..)
- example tokenization
- vocabulary overlap and performance comparison across languages
- effect of the vocabulary size
@misc{gpt-sw3-tokenizer,
title = {Training and Evaluation of a Multilingual Tokenizer for {GPT}-{SW3}},
url = {http://arxiv.org/abs/2304.14780},
author = {Stollenwerk, Felix},
year = {2023},
}