/tokenization

A comprehensive deep dive into the world of tokens

Primary LanguagePythonMIT LicenseMIT

Everything About Tokenization

Tokenization is an oft-neglected part of natural language processing. With the recent blow-up of interest in language models, it might be good to step back and really get into the guts of what tokenization is. This repo is meant to serve as a deep dive into different aspects of tokenization. It's been organized as bite-size chapters for easy navigation, with some code samples and (poorly designed) walkthrough notebooks. This is NOT meant to be a complete reference in itself, and is meant accompany other excellent resources like HuggingFace's NLP course. The following topics are covered:

  1. Intro: A quick introduction on tokens and the different tokenization algorithms out there.
  2. BPE: A closer look at the Byte-Pair Encoding tokenization algorithm. We'll also go over a minimal implementation for training a BPE model.
  3. 🤗 Tokenizer: The internals of HuggingFace tokenizers! We look at state (what's saved by a tokenizer), data structures (how does it store what it saves), and methods (what functionality do you get). We also implement a minimal <200 line version of the 🤗 Tokenizer in Python for GPT2.
  4. Challenges with Tokenization: Challenges with integer tokenization, tokenization for non-English languages and going multilingual, with a focus on the recent No Language Left Behind (NLLB) effort from Meta.
  5. Puzzles: Some simple puzzles to get you thinking about pre-tokenization, vocabulary size, etc.
  6. PostProcessing and more: A look at special tokens and postprocessing, glitch tokens and why you might want to shrink your tokenizer.
  7. Galactica: Thinking about tokenizer design by diving into the Galactica paper.
  8. Chat templates: Some tokenization tips and tricks while dealing with chat-templating for chat models.

Requirements

To run the notebooks in the repo, you only need two libraries: transformers and tiktoken:

pip install transformers tiktoken

Code has been tested with transformers==4.35.0 and tiktoken==0.5.1.

Recommended Prerequisites

A basic understanding of language models and tokenization is a must:

Contributing

If you notice any mistake/bug, or feel you could make an improvement to any section of the repo, please open an issue or make a PR 🙏