huggingface/tokenizers

How to load tokenizer trained by sentencepiece or tiktoken

jordane95 opened this issue · 5 comments

Hi, does this lib supports loading pre-trained tokenizer trained by other libs, like sentencepiece and tiktoken? Many models on hf hub store tokenizer in these formats

For sentencepiece it is mostly transformers and for tiktoken we don't have one directly 😢 It's planned for both!

@xenova if you can share some automations!

Here's my tiktoken-to-hf conversion script: https://gist.github.com/xenova/a452a6474428de0182b17605a98631ee
And then we already have a SPM converter :)

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.