/minbpe_rs

Rust implementation for Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization.

Primary LanguageRust

Tokenizer

A Rust implementation of minbpe.

Introduction

Tokenizer is a tool that breaks down text into smaller units called tokens. It is commonly used in natural language processing tasks such as text classification, sentiment analysis, and machine translation. This tokenizer is inspired by the minbpe project.

To Test

    cargo run -- aaabdaaabac

Expected Result

    Encoded: [258, 100, 258, 97, 99]
    Decoded: "aaabdaaabac"