BPE

Training exercise to learn C++.

This implements training the Byte Pair Encoding algorithm.

This algorithm is used to compress data by replacing the most frequent pair of bytes with a new byte. Often used in LLMs such as GPT-2 for tokenizing UTF-8 text.

Credits to Karpathy's explanation of the algorithm.