guillaume-be/rust-tokenizers
Rust-tokenizer offers high-performance tokenizers for modern language models, including WordPiece, Byte-Pair Encoding (BPE) and Unigram (SentencePiece) models
RustApache-2.0
Issues
- 0
Evaluate Profile-Guided Optimization (PGO) performance benefits for the library
#103 opened by zamazan4ik - 0
Port of 'rust-tokenizer' to C#/.NET
#102 opened by vermorel - 2
Slight Error in Readme?
#96 opened by ToluClassics - 0
sentencepiece is not the same
#89 opened by igor-yusupov - 5
Reading SentencePieceVocab from text file
#51 opened by MikaelCall - 2
- 2
- 10
- 2
Issues with clean_up_tokenization() function?
#16 opened by proycon - 2
Tokenize non-breaking space
#30 opened by sbeckeriv - 10
Character offset information
#14 opened by proycon