huggingface/tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

RustApache-2.0

Pinned issues

Training a model from in-memory data

#198 opened 4 years ago by loicbarrault

Closed1

ByteLevelBPETokenizer output seems weird

#203 opened 5 years ago by seyyaw

Open2

Issues

How in the world can I use a Unigram tokenizer with hf?
#1702 opened 9 days ago by Zirunis
4
Cannot find package 'tokenizers-linux-x64-musl' - Alpine support
#1703 opened 8 days ago by PylotLight
0
if split_special_tokens==True，fast_tokenizer is slower than slow_tokenizer
#1700 opened 10 days ago by gongel
0
Tokenizer Training Errors: pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: TryFromIntError(())
#1698 opened 11 days ago by Chimaco37
1
Request for pre-tokenizer that creates words based length alone.
#1697 opened 12 days ago by filbeofITK
0
How to determine the splicing logic in post_processor based on the sentence to be tokenized?
#1696 opened 17 days ago by gongel
0
NormalizedString.clear() broken?
#1636 opened 3 months ago by lkurlandski
2
Bug: is_pretokenized is not used when calling tokenizer.encode(...)
#1695 opened 23 days ago by jannessm
0
Rust Issue on Unix in 0.21.0 Version
#1694 opened 23 days ago by insculptor
0
Prebuilding tokenizers for Windows arm
#1684 opened a month ago by hariji814
1
0.20.4 Partial release?
#1690 opened 25 days ago by tazarov
6
Alignment of python versions in pypi
#1692 opened 25 days ago by PawelPeczek-Roboflow
1
Runtime Error in Python 3.8
#1691 opened 25 days ago by msharp9
9
Llama-3.2 offset-mapping needs fixing
#1688 opened a month ago by kyrawilson
1
Question: Shrinking Tokenizer Vocabulary for Reduced Memory Consumption with Pre-Trained Model (LLaMA) Fine-Tuning
#1686 opened a month ago by Amerehei
0
wikitext-103-raw-v1.zip is not available on the amazonaws anymore
#1683 opened a month ago by gec1-dev
0
Mismatch between slow and fast tokenizer
#1682 opened a month ago by KaiLv69
2
Option to disable cache for FromPretrained and FromFile
#1680 opened a month ago by daulet
2
out of memory when training a BBPE tokenizer on a large corpus
#1681 opened a month ago by yucc-leon
2
Incremental Detokenization
#1666 opened a month ago by robertgshaw2-neuralmagic
11
Error installing the tokenizers library with Python 3.13
#1657 opened a month ago by KEYTRON
3
Support `pip install` directly from GitHub
#1671 opened 2 months ago by jamesbraza
3
Tokenizers v0.20.2 fails on batches as tuples
#1672 opened 2 months ago by OyvindTafjord
2
Python 3.13 support
#1639 opened 2 months ago by iherasymenko
12
Reduce vocab size for BPE tokenizer
#1668 opened 2 months ago by fzyzcjy
8
Inconsistent behaviour of `PreTrainedTokenizerFast`s on diacritics marked texts
#1663 opened 2 months ago by sven-nm
1
How to Read Information in Large Tokenizer's Vocabulary
#1661 opened 2 months ago by kaizhuanren
1
docs-check.yml `uses node12 which is deprecated`
#1658 opened 2 months ago by hamirmahal
1
Disable pretty-print when saving tokenizer.json files
#1656 opened 3 months ago by xenova
1
Allow users to select/write encoding strategies
#1655 opened 2 months ago by pietrolesci
2
Serializing k-mer style pre-tokenizer
#1654 opened 2 months ago by millanp95
0
Leaving spaces at the beginning of next tokens?
#1650 opened 2 months ago by speedcell4
3
How to build a custom tokenizer on top of a exsiting Llama 3.2 tokenizer?
#1644 opened 3 months ago by yakhyo
7
Gradients in Data Collator lead to Memory Leak
#1649 opened 2 months ago by AhmadHAW
1
tokenizer is not adding `bos_token` or `eos_token` when tokenizing text
#1643 opened 3 months ago by MohamedAliRashad
6
Rust: How to handle models with `precompiled_charsmap = null`
#1627 opened 4 months ago by kallebysantos
5
Precompiled: Error("invalid type: null, expected a borrowed string", line : 1, column: 28)
#1645 opened 3 months ago by vicantwin
2
STATUS_ENTRYPOINT_NOT_FOUND
#1623 opened 3 months ago by impurity-dev
2
Adding tokens to a tokenizer with subword support?
#1637 opened 3 months ago by noamgat
1
Adding many AddedTokens makes loading a tokenizer extremely slow.
#1635 opened 3 months ago by stephantul
4
Access utf-8 byte sequence for each token
#1628 opened 3 months ago by DanielHesslow
2
Cannot inject custom PreTokenizer into Tokenizer
#1634 opened 3 months ago by Old-Shatterhand
6
README.md contains non-functional code
#1633 opened 3 months ago by ahenkes1
2
Tokenizer Quickstart Tutorial: Broken Links
#1625 opened 4 months ago by SinaMostafanejad
0
Special token gets tokenized while training tokenizer from scratch
#1624 opened 4 months ago by LalchandPandia
1
Space after unnormalized token is added when `use_fast=True` for Llama tokenizer
#1613 opened 4 months ago by Butanium
10
ModuleNotFoundError: No module named 'tokenizers.tokenizers'
#1619 opened 4 months ago by jpferraro1
6
Can I use SentencePieceBPETokenizer to replace google/sentencepiece?
#1614 opened 4 months ago by npuichigo
6
BPE trainer ignoring special tokens.
#1616 opened 4 months ago by henrycharlesworth
3
.NET bindings
#1615 opened 4 months ago by sappho192
0