gpt-sw3-tokenizer

Train, evaluate and analyze BPE tokenizers.

Resources

source code: https://github.com/flxst/gpt-sw3-tokenizer
documentation: https://flxst.github.io/gpt-sw3-tokenizer
paper: https://arxiv.org/abs/2304.14780

Installation

git clone https://github.com/flxst/gpt-sw3-tokenizer.git
pip install -r requirements.txt

About

This repository provides easy-to-use tools to sample (weighted) data and subsequently train, evaluate and analyze a tokenizer.

Sampling Training Evaluation Analysis

Features

Sampling

customizable amount of (disjunct) sampled data for training and evaluation
weighting of different categories and languages

Training

support for SentencePiece and HuggingFace
customizable tokenizer features (vocabulary size, handling of whitespace and numbers, ..)

Evaluation

computation of common tokenizer metrics (unknown rate, fertility, proportion of continued words, ..)

Analysis

example tokenization
vocabulary overlap and performance comparison across languages
effect of the vocabulary size

Citation

@misc{gpt-sw3-tokenizer,
  title = {Training and Evaluation of a Multilingual Tokenizer for {GPT}-{SW3}},
  url = {http://arxiv.org/abs/2304.14780},
  author = {Stollenwerk, Felix},
  year = {2023},
}