How to load tokenizer trained by sentencepiece or tiktoken
jordane95 opened this issue · 5 comments
Hi, does this lib supports loading pre-trained tokenizer trained by other libs, like sentencepiece
and tiktoken
? Many models on hf hub store tokenizer in these formats
For sentencepiece it is mostly transformers
and for tiktoken
we don't have one directly 😢 It's planned for both!
@xenova if you can share some automations!
Here's my tiktoken-to-hf conversion script: https://gist.github.com/xenova/a452a6474428de0182b17605a98631ee
And then we already have a SPM converter :)
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
Transformers now has https://github.com/huggingface/transformers/blob/main/src/transformers/convert_slow_tokenizer.py#L1478
an "official" tiktoken converter