Training exercise to learn C++.
This implements training the Byte Pair Encoding algorithm.
This algorithm is used to compress data by replacing the most frequent pair of bytes with a new byte. Often used in LLMs such as GPT-2 for tokenizing UTF-8 text.
Credits to Karpathy's explanation of the algorithm.