Rust: How to handle models with `precompiled_charsmap = null`
kallebysantos opened this issue · 5 comments
Hi guys,
I'm currently working on supabase/edge-runtime#368 that pretends to add a rust implementation of pipeline()
.
While I was coding the translation
task I figured out that I can't load the Tokenizer
instance for Xenova/opus-mt-en-fr onnx
model and their other opus-mt-*
variants.
I got the following:
let tokenizer_path = Path::new("opus-mt-en-fr/tokenizer.json");
let tokenizer = Tokenizer::from_file(tokenizer_path).unwrap();
thread 'main' panicked at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.20.0/src/normalizers/mod.rs:143:26:
Precompiled: Error("invalid type: null, expected a borrowed string", line: 1, column: 28)
stack backtrace:
0: rust_begin_unwind
at /rustc/80eb5a8e910e5185d47cdefe3732d839c78a5e7e/library/std/src/panicking.rs:662:5
1: core::panicking::panic_fmt
at /rustc/80eb5a8e910e5185d47cdefe3732d839c78a5e7e/library/core/src/panicking.rs:74:14
2: core::result::unwrap_failed
at /rustc/80eb5a8e910e5185d47cdefe3732d839c78a5e7e/library/core/src/result.rs:1679:5
3: core::result::Result<T,E>::expect
at /rustc/80eb5a8e910e5185d47cdefe3732d839c78a5e7e/library/core/src/result.rs:1059:23
4: <tokenizers::normalizers::NormalizerWrapper as serde::de::Deserialize>::deserialize
at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.20.0/src/normalizers/mod.rs:139:25
5: <serde::de::impls::OptionVisitor<T> as serde::de::Visitor>::visit_some
at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde-1.0.207/src/de/impls.rs:916:9
6: <&mut serde_json::de::Deserializer<R> as serde::de::Deserializer>::deserialize_option
at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde_json-1.0.124/src/de.rs:1672:18
7: serde::de::impls::<impl serde::de::Deserialize for core::option::Option<T>>::deserialize
at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde-1.0.207/src/de/impls.rs:935:9
8: <core::marker::PhantomData<T> as serde::de::DeserializeSeed>::deserialize
at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde-1.0.207/src/de/mod.rs:801:9
9: <serde_json::de::MapAccess<R> as serde::de::MapAccess>::next_value_seed
at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde_json-1.0.124/src/de.rs:2008:9
10: serde::de::MapAccess::next_value
at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde-1.0.207/src/de/mod.rs:1874:9
11: <tokenizers::tokenizer::serialization::TokenizerVisitor<M,N,PT,PP,D> as serde::de::Visitor>::visit_map
at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.20.0/src/tokenizer/serialization.rs:132:55
12: <&mut serde_json::de::Deserializer<R> as serde::de::Deserializer>::deserialize_struct
at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde_json-1.0.124/src/de.rs:1840:31
13: tokenizers::tokenizer::serialization::<impl serde::de::Deserialize for tokenizers::tokenizer::TokenizerImpl<M,N,PT,PP,D>>::deserialize
at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.20.0/src/tokenizer/serialization.rs:62:9
14: <tokenizers::tokenizer::_::<impl serde::de::Deserialize for tokenizers::tokenizer::Tokenizer>::deserialize::__Visitor as serde::de::Visitor>::visit_newtype_struct
at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.20.0/src/tokenizer/mod.rs:408:21
15: <&mut serde_json::de::Deserializer<R> as serde::de::Deserializer>::deserialize_newtype_struct
at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde_json-1.0.124/src/de.rs:1723:9
16: tokenizers::tokenizer::_::<impl serde::de::Deserialize for tokenizers::tokenizer::Tokenizer>::deserialize
at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.20.0/src/tokenizer/mod.rs:408:21
17: serde_json::de::from_trait
at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde_json-1.0.124/src/de.rs:2478:22
18: serde_json::de::from_str
at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde_json-1.0.124/src/de.rs:2679:5
19: tokenizers::tokenizer::Tokenizer::from_file
at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.20.0/src/tokenizer/mod.rs:439:25
20: transformers_rs::pipeline::tasks::seq_to_seq::seq_to_seq
at ./src/pipeline/tasks/seq_to_seq.rs:51:21
21: app::main
at ./examples/app/src/main.rs:78:5
22: core::ops::function::FnOnce::call_once
at /rustc/80eb5a8e910e5185d47cdefe3732d839c78a5e7e/library/core/src/ops/function.rs:250:5
I now that it occurs because their tokenizer.json
file was the following:
opus-mt-en-fr:
"normalizer": {
"type": "Precompiled",
"precompiled_charsmap": null
}
While the expected behavior must be something like this:
nllb-200-distilled-600M:
"normalizer": {
"type": "Sequence",
"normalizers": [
{
"type": "Precompiled",
"precompiled_charsmap": "ALQCAACEAAA..."
}
]
}
Looking in the original version of Helsinki-NLP/opus-mt-en-fr I notice that is no tokenizer.json
file for it.
I would like to know if is the precompiled_charsmap
necessary expect a non-null?
Maybe it could be handle as
Option<_>
?
Is there some workaround to execute theses models without change the internal model files?
How can I handle an exported onnx
model that doesn't have the tokenizer.json
file?
I'm seeing the same error with Python when trying to read the tokenizer from Xenova/speecht5_tts.
wget https://huggingface.co/Xenova/speecht5_tts/resolve/main/tokenizer.json
from tokenizers import Tokenizer
Tokenizer.from_file("tokenizer.json")
thread '<unnamed>' panicked at /Users/runner/work/tokenizers/tokenizers/tokenizers/src/normalizers/mod.rs:143:26:
Precompiled: Error("invalid type: null, expected a borrowed string", line: 1, column: 28)
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
...
pyo3_runtime.PanicException: Precompiled: Error("invalid type: null, expected a borrowed string", line: 1, column: 28)
With Tokenizers 0.19.0, this raised an error which could be handled rather than a panic. It looks like this may be related to #1604.
I think passing a ""
might work. cc @xenova not sure why you end up with nulls
there, but we can probably syn and I can add support for option!
I think passing a
""
might work. cc @xenova not sure why you end up withnulls
there, but we can probably syn and I can add support for option!
Xenova implementation doesn't call the value directly but applies iterators over config normalizers. I think that it ignores the null values.
I agree with you, add support for Option<>
may solve it.
I've implemented spm_precompiled with null support at vicantwin/spm_precompiled, which includes a test with null support, and all tests pass successfully.
But, I need some help with changing this repository, as I'm not entirely familiar with this codebase and unsure how to implement the necessary changes. Any help would be greatly appreciated.