huggingface/tokenizers
💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
RustApache-2.0
Pinned issues
Issues
- 4
How in the world can I use a Unigram tokenizer with hf?
#1702 opened by Zirunis - 0
- 0
- 1
Tokenizer Training Errors: pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: TryFromIntError(())
#1698 opened by Chimaco37 - 0
- 0
How to determine the splicing logic in post_processor based on the sentence to be tokenized?
#1696 opened by gongel - 2
NormalizedString.clear() broken?
#1636 opened by lkurlandski - 0
- 0
Rust Issue on Unix in 0.21.0 Version
#1694 opened by insculptor - 1
Prebuilding tokenizers for Windows arm
#1684 opened by hariji814 - 6
0.20.4 Partial release?
#1690 opened by tazarov - 1
Alignment of python versions in pypi
#1692 opened by PawelPeczek-Roboflow - 9
Runtime Error in Python 3.8
#1691 opened by msharp9 - 1
Llama-3.2 offset-mapping needs fixing
#1688 opened by kyrawilson - 0
Question: Shrinking Tokenizer Vocabulary for Reduced Memory Consumption with Pre-Trained Model (LLaMA) Fine-Tuning
#1686 opened by Amerehei - 0
- 2
Mismatch between slow and fast tokenizer
#1682 opened by KaiLv69 - 2
Option to disable cache for FromPretrained and FromFile
#1680 opened by daulet - 2
- 11
Incremental Detokenization
#1666 opened by robertgshaw2-neuralmagic - 3
Error installing the tokenizers library with Python 3.13
#1657 opened by KEYTRON - 3
Support `pip install` directly from GitHub
#1671 opened by jamesbraza - 2
Tokenizers v0.20.2 fails on batches as tuples
#1672 opened by OyvindTafjord - 12
Python 3.13 support
#1639 opened by iherasymenko - 8
Reduce vocab size for BPE tokenizer
#1668 opened by fzyzcjy - 1
Inconsistent behaviour of `PreTrainedTokenizerFast`s on diacritics marked texts
#1663 opened by sven-nm - 1
- 1
docs-check.yml `uses node12 which is deprecated`
#1658 opened by hamirmahal - 1
Disable pretty-print when saving tokenizer.json files
#1656 opened by xenova - 2
Allow users to select/write encoding strategies
#1655 opened by pietrolesci - 0
Serializing k-mer style pre-tokenizer
#1654 opened by millanp95 - 3
Leaving spaces at the beginning of next tokens?
#1650 opened by speedcell4 - 7
- 1
Gradients in Data Collator lead to Memory Leak
#1649 opened by AhmadHAW - 6
tokenizer is not adding `bos_token` or `eos_token` when tokenizing text
#1643 opened by MohamedAliRashad - 5
- 2
Precompiled: Error("invalid type: null, expected a borrowed string", line : 1, column: 28)
#1645 opened by vicantwin - 2
STATUS_ENTRYPOINT_NOT_FOUND
#1623 opened by impurity-dev - 1
Adding tokens to a tokenizer with subword support?
#1637 opened by noamgat - 4
- 2
Access utf-8 byte sequence for each token
#1628 opened by DanielHesslow - 6
Cannot inject custom PreTokenizer into Tokenizer
#1634 opened by Old-Shatterhand - 2
README.md contains non-functional code
#1633 opened by ahenkes1 - 0
Tokenizer Quickstart Tutorial: Broken Links
#1625 opened by SinaMostafanejad - 1
- 10
Space after unnormalized token is added when `use_fast=True` for Llama tokenizer
#1613 opened by Butanium - 6
- 6
- 3
BPE trainer ignoring special tokens.
#1616 opened by henrycharlesworth - 0
.NET bindings
#1615 opened by sappho192