huggingface/tokenizers

Rust: How to handle models with `precompiled_charsmap = null`

kallebysantos opened this issue · 5 comments

Hi guys,
I'm currently working on supabase/edge-runtime#368 that pretends to add a rust implementation of pipeline().

While I was coding the translation task I figured out that I can't load the Tokenizer instance for Xenova/opus-mt-en-fr onnx model and their other opus-mt-* variants.

I got the following:
let tokenizer_path = Path::new("opus-mt-en-fr/tokenizer.json");
let tokenizer = Tokenizer::from_file(tokenizer_path).unwrap();
thread 'main' panicked at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.20.0/src/normalizers/mod.rs:143:26:
Precompiled: Error("invalid type: null, expected a borrowed string", line: 1, column: 28)
stack backtrace:
   0: rust_begin_unwind
             at /rustc/80eb5a8e910e5185d47cdefe3732d839c78a5e7e/library/std/src/panicking.rs:662:5
   1: core::panicking::panic_fmt
             at /rustc/80eb5a8e910e5185d47cdefe3732d839c78a5e7e/library/core/src/panicking.rs:74:14
   2: core::result::unwrap_failed
             at /rustc/80eb5a8e910e5185d47cdefe3732d839c78a5e7e/library/core/src/result.rs:1679:5
   3: core::result::Result<T,E>::expect
             at /rustc/80eb5a8e910e5185d47cdefe3732d839c78a5e7e/library/core/src/result.rs:1059:23
   4: <tokenizers::normalizers::NormalizerWrapper as serde::de::Deserialize>::deserialize
             at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.20.0/src/normalizers/mod.rs:139:25
   5: <serde::de::impls::OptionVisitor<T> as serde::de::Visitor>::visit_some
             at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde-1.0.207/src/de/impls.rs:916:9
   6: <&mut serde_json::de::Deserializer<R> as serde::de::Deserializer>::deserialize_option
             at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde_json-1.0.124/src/de.rs:1672:18
   7: serde::de::impls::<impl serde::de::Deserialize for core::option::Option<T>>::deserialize
             at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde-1.0.207/src/de/impls.rs:935:9
   8: <core::marker::PhantomData<T> as serde::de::DeserializeSeed>::deserialize
             at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde-1.0.207/src/de/mod.rs:801:9
   9: <serde_json::de::MapAccess<R> as serde::de::MapAccess>::next_value_seed
             at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde_json-1.0.124/src/de.rs:2008:9
  10: serde::de::MapAccess::next_value
             at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde-1.0.207/src/de/mod.rs:1874:9
  11: <tokenizers::tokenizer::serialization::TokenizerVisitor<M,N,PT,PP,D> as serde::de::Visitor>::visit_map
             at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.20.0/src/tokenizer/serialization.rs:132:55
  12: <&mut serde_json::de::Deserializer<R> as serde::de::Deserializer>::deserialize_struct
             at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde_json-1.0.124/src/de.rs:1840:31
  13: tokenizers::tokenizer::serialization::<impl serde::de::Deserialize for tokenizers::tokenizer::TokenizerImpl<M,N,PT,PP,D>>::deserialize
             at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.20.0/src/tokenizer/serialization.rs:62:9
  14: <tokenizers::tokenizer::_::<impl serde::de::Deserialize for tokenizers::tokenizer::Tokenizer>::deserialize::__Visitor as serde::de::Visitor>::visit_newtype_struct
             at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.20.0/src/tokenizer/mod.rs:408:21
  15: <&mut serde_json::de::Deserializer<R> as serde::de::Deserializer>::deserialize_newtype_struct
             at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde_json-1.0.124/src/de.rs:1723:9
  16: tokenizers::tokenizer::_::<impl serde::de::Deserialize for tokenizers::tokenizer::Tokenizer>::deserialize
             at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.20.0/src/tokenizer/mod.rs:408:21
  17: serde_json::de::from_trait
             at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde_json-1.0.124/src/de.rs:2478:22
  18: serde_json::de::from_str
             at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde_json-1.0.124/src/de.rs:2679:5
  19: tokenizers::tokenizer::Tokenizer::from_file
             at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.20.0/src/tokenizer/mod.rs:439:25
  20: transformers_rs::pipeline::tasks::seq_to_seq::seq_to_seq
             at ./src/pipeline/tasks/seq_to_seq.rs:51:21
  21: app::main
             at ./examples/app/src/main.rs:78:5
  22: core::ops::function::FnOnce::call_once
             at /rustc/80eb5a8e910e5185d47cdefe3732d839c78a5e7e/library/core/src/ops/function.rs:250:5

I now that it occurs because their tokenizer.json file was the following:

opus-mt-en-fr:

"normalizer": {
    "type": "Precompiled",
    "precompiled_charsmap": null
}

While the expected behavior must be something like this:

nllb-200-distilled-600M:

"normalizer": {                           
   "type": "Sequence",                     
   "normalizers": [                        
     {                                     
       "type": "Precompiled",              
       "precompiled_charsmap": "ALQCAACEAAA..."
     }                                    
   ]                                       
 }

Looking in the original version of Helsinki-NLP/opus-mt-en-fr I notice that is no tokenizer.json file for it.

I would like to know if is the precompiled_charsmap necessary expect a non-null?

Maybe it could be handle as Option<_>?

Is there some workaround to execute theses models without change the internal model files?
How can I handle an exported onnx model that doesn't have the tokenizer.json file?

I'm seeing the same error with Python when trying to read the tokenizer from Xenova/speecht5_tts.

wget https://huggingface.co/Xenova/speecht5_tts/resolve/main/tokenizer.json
from tokenizers import Tokenizer

Tokenizer.from_file("tokenizer.json")
thread '<unnamed>' panicked at /Users/runner/work/tokenizers/tokenizers/tokenizers/src/normalizers/mod.rs:143:26:
Precompiled: Error("invalid type: null, expected a borrowed string", line: 1, column: 28)
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
...
pyo3_runtime.PanicException: Precompiled: Error("invalid type: null, expected a borrowed string", line: 1, column: 28)

With Tokenizers 0.19.0, this raised an error which could be handled rather than a panic. It looks like this may be related to #1604.

I'm also facing the same issue (#1645) with speecht5_tts.

I think passing a "" might work. cc @xenova not sure why you end up with nulls there, but we can probably syn and I can add support for option!

I think passing a "" might work. cc @xenova not sure why you end up with nulls there, but we can probably syn and I can add support for option!

Xenova implementation doesn't call the value directly but applies iterators over config normalizers. I think that it ignores the null values.

I agree with you, add support for Option<> may solve it.

I've implemented spm_precompiled with null support at vicantwin/spm_precompiled, which includes a test with null support, and all tests pass successfully.

But, I need some help with changing this repository, as I'm not entirely familiar with this codebase and unsure how to implement the necessary changes. Any help would be greatly appreciated.