/gpt-sw3-tokenizer

BPE tokenizer with HuggingFace / SentencePiece

Primary LanguagePython

gpt-sw3-tokenizer

Train, evaluate and analyze BPE tokenizers.

CI Coverage Status Code style: black

Resources

Installation

git clone https://github.com/flxst/gpt-sw3-tokenizer.git
pip install -r requirements.txt

About

This repository provides easy-to-use tools to sample (weighted) data and subsequently train, evaluate and analyze a tokenizer.

sampling        training        evaluation        analysis 
Sampling      Training     Evaluation    Analysis   
 

Features

  Sampling

  • customizable amount of (disjunct) sampled data for training and evaluation
  • weighting of different categories and languages

  Training

  • support for SentencePiece and HuggingFace
  • customizable tokenizer features (vocabulary size, handling of whitespace and numbers, ..)

  Evaluation

  • computation of common tokenizer metrics (unknown rate, fertility, proportion of continued words, ..)

  Analysis

  • example tokenization
  • vocabulary overlap and performance comparison across languages
  • effect of the vocabulary size

Citation

@misc{gpt-sw3-tokenizer,
  title = {Training and Evaluation of a Multilingual Tokenizer for {GPT}-{SW3}},
  url = {http://arxiv.org/abs/2304.14780},
  author = {Stollenwerk, Felix},
  year = {2023},
}