Where should I get `decoder_model_merged` file from?
Opened this issue · 8 comments
Question
Hey,
I'm trying to use whisper-web
demo with my finetuned model.
After I managed connecting my model to the demo application, I'm getting errors related to this:
Basically, when transformers.js
tries to load a whisper model, it looks for files called decoder_model_merged.onnx
/ decoder_model_merged_quantized.onnx
/ decoder_model_merged_fp16.onnx
.
The thing is, that the conversion script didn't create any of these files.
That's how the conversion script output looks like:
Please help me figure out what am I missing here.
P.S. After I'll get it to work, I'll be happy to open a PR on whisper-web
repository that will enable using local models together with remote (on HF hub) models.
Thanks !
I think it could be related to: xenova/whisper-web#24
Can you try with the Transformers.js v3 conversion script?
git clone -b v3 https://github.com/xenova/transformers.js.git
cd transformers.js
pip install -q -r scripts/requirements.txt
python -m scripts.convert --quantize --model_id MODEL_ID_GOES_HERE
Hey @xenova,
Sure, I'll give it a try.
Does that means I should also update my transformers.js
version? It's now at ^2.7.0
, according to the package.json
of the whisper-web
project.
Thanks !
Hey @xenova and everyone else who will get here,
The problem was with how I ran the conversion script. That's how it should be done:
python -m scripts.convert --quantize --model_id MODEL_ID_GOES_HERE --task automatic-speech-recognition-with-past
Hey @xenova,
Now I'm getting: 'An error occurred during model execution: "Missing the following inputs: cache_position.'
.
What can be the issue ?
@xenova one last update (in the meantime),
By reverse engineering the onnx-community/whisper-*
artifacts you uploaded to HF, we found out that 2 things were causing this issue:
- Using the conversion script from
v3
branch with a large whisper model led to thecache_position
exception I attached above. It seems to have something to do with this being a required parameter in a highertransformers
version (as requested in the requirements file of the conversion scripts), thattransformers.js
orwhisper-web
(on webgpu branch) code still doesn't take into consideration. - No matter what we tried (both development/v3 branches),
onnx
conversion of any of the large versions of Whisper just doesn't work. It fails on the ATOL validation, no matter the atol we provided (even when 1 was provided as ATOL). This happened only in the full precision (fp32) conversion, which is required in order to runwhisper-webgpu
. When we tried lower precision (fp16) in the encoder, we got tons of exclamation marks in the response from the model.
I would love to hear any further feedback from you, as we really want to integrate transformers.js
into our codebase, but currently the things above are blockers for us.
Thank you very much for your work !
I am also curious about having more information about the conversion flow - specifically I'd like to know how the timestamped models like this where trained.
I have also run into issues with lots of quantification variants simply not working.
Hey @xenova, Now I'm getting:
'An error occurred during model execution: "Missing the following inputs: cache_position.'
. What can be the issue ?
Can you please help resolve this issue?
For me if I try to inference the converted whisper model through python it works. however when doing the same through whisper web / transformers js I receive this error message.
The Python inference code (I just changed the model path):
from transformers import AutoProcessor, pipeline
from optimum.onnxruntime import ORTModelForSpeechSeq2Seq
from datasets import load_dataset
processor = AutoProcessor.from_pretrained("optimum/whisper-tiny.en")
model = ORTModelForSpeechSeq2Seq.from_pretrained("optimum/whisper-tiny.en")
speech_recognition = pipeline("automatic-speech-recognition", model=model, tokenizer=processor.tokenizer, feature_extractor=processor.feature_extractor)
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
pred = speech_recognition(ds[0]["audio"]["array"])