huggingface/tokenizers
💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
RustApache-2.0
Pinned issues
Issues
- 1
Strange warnings with tokenizer for some models
#1528 opened by EricLBuehler - 2
- 0
How to Batch-Encode Paired Input Sentences with Tokenizers: Seeking Clarification
#1531 opened by insookim43 - 3
Extended vocab tokenizer merging text into a single string without spaces while decoding
#1501 opened by savanth14 - 4
Train tokenizer on integer lists, not strings
#1471 opened by rteehas - 0
Bug with `CodeQwen1.5`: `data did not match any variant of untagged enum PyPreTokenizerTypeWrapper`
#1529 opened by QwertyJack - 3
error: casting `&T` to `&mut T` is undefined behavior
#1485 opened by Jipok - 1
Deepseeker model completely loses performance after using tokenizer.add_tokens(special_tokens)
#1490 opened by bin123apple - 1
- 0
Special token handling breaks idempotency of sentencepiece due to extra spaces
#1527 opened by cat-state - 0
- 0
How to write custom Wordpiece class?
#1525 opened by xinyinan9527 - 0
Convert huggingface tokenizer into sentencepiece format
#1524 opened by RRaphaell - 4
Loading `tokenizer.model` with Rust API
#1518 opened by EricLBuehler - 3
Breaking changes in v0.19.1 for tiktoken/llama3
#1512 opened by sanderland - 0
❓Get stats (e.g. counts) about the merged pairs
#1523 opened by pietrolesci - 1
BPE Trainer doesn't respect the `vocab_size` parameter when dataset size is increased
#1514 opened by Abhinay1997 - 2
Discrepancy Between GitHub Release and NPM Package Version & Missing Dependencies
#1489 opened by superBertBerg - 0
- 1
tokenizers-linux-x64-musl is not found when running inside node apline docker
#1480 opened by madhurjya-acko - 3
- 2
BPE Decoder cleanup option
#1474 opened by w-zygmuntowicz - 2
BpeTrainer seems to ignore max_token_length=1
#1461 opened by geajack - 1
Llama3 tokenizer with Incorrect offset_mapping
#1517 opened by justin-shao - 5
- 2
Why are 'unknown' tokens randomly added to my tokenized input?
#1520 opened by tshmak - 1
Why the tokenizer is slower than tiktoken?
#1519 opened by BigBinnie - 3
- 1
Issue merging across whitespaces
#1475 opened by henrycharlesworth - 0
Tokens Removed from Trained Custom BPE Tokenizer
#1516 opened by rteehas - 0
UnigramTrainer: byte_fallback is false.
#1515 opened by Moddus - 0
Cross-compilation fails for custom target
#1509 opened by semaraugusto - 3
New Update causes add_special_tokens not recognized
#1466 opened by sravell - 2
`cargo build` fails for python bindings when `--locked` is passed for `v0.15.1` and `v0.15.2`
#1477 opened by CobaltCause - 0
Treatment of hyphenated words
#1507 opened by rattle99 - 1
Failing to build bindings with 0.19.1
#1505 opened by bryteise - 1
- 0
Unsound use of unsafe in `src/utils/parallelism.rs`
#1491 opened by albertsgarde - 1
Tokens display issues
#1470 opened by jordane95 - 1
offline installation
#1502 opened by HankLiu10 - 0
- 2
Potential vulnerability: Control token injection through Jinja templates in apply_chat_template
#1458 opened by pluiez - 3
Training a tokenizer with limited memory
#1460 opened by arxyzan - 2
Tokenizer dataset is very slow
#1464 opened by ManuSinghYadav - 0
StripAccents doesn't work
#1496 opened by NivinaNull - 1
LLamaTokenizer with `use_fast=True` / and `use_fast=False` causing memory leak when used with multiprocessing / `dataset.map(num_proc)`
#1495 opened by michaelfeil - 2
- 1
- 2
Support operating computer system
#1457 opened by Southpika - 1