huggingface/tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

RustApache-2.0

Pinned issues

ByteLevelBPETokenizer output seems weird

#203 opened 4 years ago

Open2

Training a model from in-memory data

#198 opened 4 years ago

Closed1

Issues

when compile with tch-rs library , encounter static libcpmt.lib and dynamic msvcprt.lib conflict link error
#1454 opened 3 months ago
2
thread '<unnamed>' panicked at /home/runner/work/tokenizers/tokenizers/tokenizers/src/tokenizer/added_vocabulary.rs:428:22: AddedVocabulary bad split
#1452 opened 15 days ago
7
Accessing Tokenizer's model_max_length config
#1451 opened 4 months ago
1
bugs
#1450 opened 4 months ago
5
Normalizer "replace" is quadratic in sequence length (impacts Llama 2 tokenizer)
#1449 opened 3 months ago
6
[Potential Bug] Mistral Tokenizer Inconsistencies
#1448 opened 2 months ago
5
tokenizers.cpython-311-darwin.so wrong architecture
#1447 opened 2 months ago
2
Building a tokenzier for tokenizing Java code
#1446 opened 2 months ago
2
Outputting many different tokenizer vocab sizes for comparisons
#1445 opened 3 months ago
2
is there a guidance to adapt tokenizers to c++ project?
#1440 opened 3 months ago
4
/
#1439 opened 4 months ago
0
tokenizer.train_new_from_iterator() takes time
#1434 opened 3 months ago
2
ValueError: Tokenizer class LlamaTokenizer does not exist or is not currently imported. Getting this error when I try to run the below code:
#1431 opened 4 months ago
1
Python3.12 build for Windows is not available
#1429 opened 4 months ago
3
Profile-Guided Optimization (PGO) benchmark results
#1426 opened 4 months ago
4
"make bench" command does not download all required resources
#1425 opened 5 months ago
0
Decoding Issue for Latin Characters in `added_tokens`
#1424 opened 3 months ago
2
Possible bug in case of prepending chars in a pretokenizer
#1423 opened a month ago
7
loading `added_tokens.json`
#1422 opened 5 months ago
3
Memory Leak in encode_batch Function
#1421 opened 4 months ago
5
Unsupported platform for tokenizers
#1418 opened 4 months ago
2
Questions re: Tokenizer pipeline composability
#1417 opened 5 months ago
2
ModuleNotFoundError: No module named 'tokenizers.tokenizers'
#1416 opened 4 months ago
5
Support PyArrow arrays as tokenizer input
#1415 opened 12 days ago
11
Performance of tokenizer for CLIP text model
#1412 opened 4 months ago
2
How to create Tokenizer.json?
#1410 opened 5 months ago
2
Tokenizer **not saving/loading** correctly after adding tokens, then training
#1409 opened 3 months ago
8
Special tokens will be split when there is no space before them
#1408 opened 6 months ago
1
How to add byte_fallback tokens?
#1407 opened 2 months ago
4
Tokenization is super slow when using XGLMTokenizer or XGLMTokenizerFast
#1405 opened 4 months ago
7
Use NodeJs: Cannot find module 'tokenizers-darwin-arm64'
#1403 opened a month ago
8
Installation error with pip install tokenizers==0.12.1 – Compatibility issue with Python 3.6.15 and Rust 1.72.0
#1402 opened 5 months ago
10
Demonstrating Sentence Truncation in Tokenization
#1401 opened 5 months ago
3
Another Implementation (faster and more effecient) of BPE Training Algorithm
#1400 opened 4 months ago
26
A whitespace character not displaying at a specific position
#1399 opened 6 months ago
2
Rust tokenizer fails!
#1398 opened 5 months ago
2
Integration with google/oss-fuzz for continuous fuzzing
#1397 opened 5 months ago
1
train_new_from_iterator fails in non-space separated languages
#1395 opened 4 months ago
5
unable to install on python 3.12 via pip
#1393 opened 4 months ago
10
added_tokens with bytemap charaters in ByteLevel could not be decoded correctly
#1392 opened 3 months ago
8
How to split special token in encode?
#1391 opened 5 months ago
5
apply_chat_template() with tokenize=False returns incorrect string
#1389 opened 5 months ago
2
is there a javascript version for tokenizers
#1387 opened 5 months ago
2
Error: Cannot find module 'tokenizers-linux-x64-musl'
#1384 opened 5 months ago
5
Add tokens not impacted by training
#1380 opened 6 months ago
6
RobertaTokenizer : tokenizer.decode and tokenizer.tokenize do not generate the same output
#1376 opened 6 months ago
4
Question: what is the add_special_tokens parameter of Tokenizer::encode?
#1375 opened 7 months ago
4
add_tokens has no effect in llama fast tokenizer
#1374 opened 7 months ago
1
Can not load tokoenizer from_pretrained through http_proxy since 0.14.0
#1373 opened 5 months ago
7
end_of_word_suffix = "</w>" no work??
#1372 opened 6 months ago
3