huggingface/tokenizers
💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
RustApache-2.0
Pinned issues
Issues
- 2
when compile with tch-rs library , encounter static libcpmt.lib and dynamic msvcprt.lib conflict link error
#1454 opened - 7
- 1
Accessing Tokenizer's model_max_length config
#1451 opened - 5
bugs
#1450 opened - 6
- 5
- 2
- 2
Building a tokenzier for tokenizing Java code
#1446 opened - 2
- 4
- 0
/
#1439 opened - 2
tokenizer.train_new_from_iterator() takes time
#1434 opened - 1
- 3
Python3.12 build for Windows is not available
#1429 opened - 4
- 0
- 2
- 7
- 3
loading `added_tokens.json`
#1422 opened - 5
Memory Leak in encode_batch Function
#1421 opened - 2
Unsupported platform for tokenizers
#1418 opened - 2
Questions re: Tokenizer pipeline composability
#1417 opened - 5
- 11
Support PyArrow arrays as tokenizer input
#1415 opened - 2
Performance of tokenizer for CLIP text model
#1412 opened - 2
How to create Tokenizer.json?
#1410 opened - 8
- 1
- 4
How to add byte_fallback tokens?
#1407 opened - 7
- 8
- 10
- 3
- 26
- 2
- 2
Rust tokenizer fails!
#1398 opened - 1
- 5
- 10
unable to install on python 3.12 via pip
#1393 opened - 8
- 5
How to split special token in encode?
#1391 opened - 2
- 2
is there a javascript version for tokenizers
#1387 opened - 5
- 6
Add tokens not impacted by training
#1380 opened - 4
RobertaTokenizer : tokenizer.decode and tokenizer.tokenize do not generate the same output
#1376 opened - 4
- 1
add_tokens has no effect in llama fast tokenizer
#1374 opened - 7
- 3
end_of_word_suffix = "</w>" no work??
#1372 opened