Loading `tokenizer.model` with Rust API
EricLBuehler opened this issue · 5 comments
Hello all,
Thank you for your excellent work here. I am trying to load a tokenizer.model
file in my Rust application. However, it seems that the Tokenizer::from_file
function only support loading from a tokenizer.json
file. This causes problems as using a small script to save the tokenizer.json
is error-prone and hard to discover for users. Is there a way to load a tokenizer.model
file?
You cannot load a tokenizer.model, you need to write a converter.
This is because it does not come from the tokenizers
library but from either tiktoken
or sentencepiece
and there is no secret recipe. We need to adapt to the content of the file, but this is not super straight forward.
https://github.com/huggingface/transformers/blob/main/src/transformers/convert_slow_tokenizer.py#L544 is the simplest way to understand the process!
Ok, I understand. Do you know of a way or a library to do this in Rust without reaching for the Python transformers converter?
A library no, but we should be able to come up with a small rust code to do this 😉
@ArthurZucker are there any specifications or example loaders which I can look at to implement this?
I also have the same question, for llava reasons😉