huggingface/tokenizers

Loading `tokenizer.model` with Rust API

EricLBuehler opened this issue · 5 comments

Hello all,

Thank you for your excellent work here. I am trying to load a tokenizer.model file in my Rust application. However, it seems that the Tokenizer::from_file function only support loading from a tokenizer.json file. This causes problems as using a small script to save the tokenizer.json is error-prone and hard to discover for users. Is there a way to load a tokenizer.model file?

You cannot load a tokenizer.model, you need to write a converter.
This is because it does not come from the tokenizers library but from either tiktoken or sentencepiece and there is no secret recipe. We need to adapt to the content of the file, but this is not super straight forward.

https://github.com/huggingface/transformers/blob/main/src/transformers/convert_slow_tokenizer.py#L544 is the simplest way to understand the process!

Ok, I understand. Do you know of a way or a library to do this in Rust without reaching for the Python transformers converter?

A library no, but we should be able to come up with a small rust code to do this 😉

@ArthurZucker are there any specifications or example loaders which I can look at to implement this?

I also have the same question, for llava reasons😉