Where should I get `decoder_model_merged` file from?

Question

Where should I get `decoder_model_merged` file from?

Opened this issue 3 months ago · 8 comments

Question

Hey,
I'm trying to use whisper-web demo with my finetuned model.
After I managed connecting my model to the demo application, I'm getting errors related to this:

https://github.com/xenova/transformers.js/blob/7f5081da29c3f77ee830269ab801344776e61bcb/src/models.js#L771

Basically, when transformers.js tries to load a whisper model, it looks for files called decoder_model_merged.onnx / decoder_model_merged_quantized.onnx / decoder_model_merged_fp16.onnx.
The thing is, that the conversion script didn't create any of these files.
That's how the conversion script output looks like:

Please help me figure out what am I missing here.
P.S. After I'll get it to work, I'll be happy to open a PR on whisper-web repository that will enable using local models together with remote (on HF hub) models.
Thanks !

Answer 1 · 2024-09-02T13:03:57.000Z

I think it could be related to: xenova/whisper-web#24

Answer 2 · 2024-09-02T15:28:42.000Z

Can you try with the Transformers.js v3 conversion script?

git clone -b v3 https://github.com/xenova/transformers.js.git
cd transformers.js
pip install -q -r scripts/requirements.txt
python -m scripts.convert --quantize --model_id MODEL_ID_GOES_HERE

Answer 3 · 2024-09-03T06:33:48.000Z

Hey @xenova,
Sure, I'll give it a try.
Does that means I should also update my transformers.js version? It's now at ^2.7.0, according to the package.json of the whisper-web project.
Thanks !

Answer 4 · 2024-09-08T08:42:17.000Z

Hey @xenova and everyone else who will get here,
The problem was with how I ran the conversion script. That's how it should be done:

python -m scripts.convert --quantize --model_id MODEL_ID_GOES_HERE --task automatic-speech-recognition-with-past

Answer 5 · 2024-09-08T09:42:06.000Z

Hey @xenova,
Now I'm getting: 'An error occurred during model execution: "Missing the following inputs: cache_position.'.
What can be the issue ?

Answer 6 · 2024-09-08T19:01:52.000Z

@xenova one last update (in the meantime),
By reverse engineering the onnx-community/whisper-* artifacts you uploaded to HF, we found out that 2 things were causing this issue:

Using the conversion script from v3 branch with a large whisper model led to the cache_position exception I attached above. It seems to have something to do with this being a required parameter in a higher transformers version (as requested in the requirements file of the conversion scripts), that transformers.js or whisper-web (on webgpu branch) code still doesn't take into consideration.
No matter what we tried (both development/v3 branches), onnx conversion of any of the large versions of Whisper just doesn't work. It fails on the ATOL validation, no matter the atol we provided (even when 1 was provided as ATOL). This happened only in the full precision (fp32) conversion, which is required in order to run whisper-webgpu . When we tried lower precision (fp16) in the encoder, we got tons of exclamation marks in the response from the model.

I would love to hear any further feedback from you, as we really want to integrate transformers.js into our codebase, but currently the things above are blockers for us.

Thank you very much for your work !

Answer 7 · 2024-09-11T17:32:49.000Z

I am also curious about having more information about the conversion flow - specifically I'd like to know how the timestamped models like this where trained.

I have also run into issues with lots of quantification variants simply not working.

Answer 8 · 2024-11-24T11:53:19.000Z

Hey @xenova, Now I'm getting: 'An error occurred during model execution: "Missing the following inputs: cache_position.'. What can be the issue ?

@xenova

Can you please help resolve this issue?
For me if I try to inference the converted whisper model through python it works. however when doing the same through whisper web / transformers js I receive this error message.

The Python inference code (I just changed the model path):

from transformers import AutoProcessor, pipeline
from optimum.onnxruntime import ORTModelForSpeechSeq2Seq
from datasets import load_dataset

processor = AutoProcessor.from_pretrained("optimum/whisper-tiny.en")
model = ORTModelForSpeechSeq2Seq.from_pretrained("optimum/whisper-tiny.en")
speech_recognition = pipeline("automatic-speech-recognition", model=model, tokenizer=processor.tokenizer, feature_extractor=processor.feature_extractor)

ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
pred = speech_recognition(ds[0]["audio"]["array"])