Trouble importing finetuned HuggingFace model

Question

Trouble importing finetuned HuggingFace model

AlienKevin opened this issue a year ago · 13 comments

Hi, thanks for developing this awesome Whisper implementation! I'm looking to deploy a small Whisper model I finetuned using HuggingFace transformers. The model is supposed to generated cantonese romanizations and the language is set to English during training because they share the same ascii letters. The primary motivation is to take advantage of burn's wgpu backend for cross platform deployment to both iOS and Android users. Prior to trying your library, I managed to get my finetuned model running on iOS using whisper.cpp but I'd prefer a rust backend for portability.

For my experiment with importing the model into whisper-burn, I first converted the HuggingFace model to Whisper's pt format using a script (See step 1 of this issue). And then I followed the steps in the README and successfully converted the model to the burn format. However, when I run inference using my model, it produced garbage transcripts on the provided audio16k.wav as well as on my own test audio. For example, the audio16k.wav produced a transcript of "onbed" when normally the model should recognize English inputs in addition to Cantonese.

I'm wondering if it's possible for you to support importing HuggingFace models directly to whisper-burn? That way, it's easier to eliminate intermediate bugs during the conversion pipeline. Maybe the convert-h5-to-ggml from Whisper.cpp can come in handy? Thanks.

Answer 1 · 2023-09-21T14:09:22.000Z

I'm wondering if it's possible for you to support importing HuggingFace models directly to whisper-burn? That way, it's easier to eliminate intermediate bugs during the conversion pipeline. Maybe the convert-h5-to-ggml <https://github.com/ggerganov/whisper.cpp/blob/master/models/convert-h5-to-ggml.py> from Whisper.cpp can come in handy? Thanks.

I’ll look into it this weekend. Good luck on your Cantonese transcription project! I had wanted to learn Cantonese but I feel I haven’t yet mastered Mandarin well enough.

…

On Sep 21, 2023, at 12:16 AM, Xiang (Kevin) Li ***@***.***> wrote: Hi, thanks for developing this awesome Whisper implementation! I'm looking to deploy a small Whisper model <https://huggingface.co/AlienKevin/whisper-small-jyutping-without-tones-full> I finetuned using HuggingFace transformers. The model is supposed to generated cantonese romanizations and the language is set to English during training because they share the same ascii letters. The primary motivation is to take advantage of burn's wgpu backend for cross platform deployment to both iOS and Android users. Prior to trying your library, I managed to get my finetuned model running on iOS using whisper.cpp but I'd prefer a rust backend for portability. For my experiment with importing the model into whisper-burn, I first converted the HuggingFace model to Whisper's pt format using a script (See step 1 of this issue <ggerganov/whisper.cpp#1296 (comment)>). And then I followed the steps in the README and successfully converted the model to the burn format. However, when I run inference using my model, it produced garbage transcripts on the provided audio16k.wav as well as on my own test audio. For example, the audio16k.wav produced a transcript of "onbed" when normally the model should recognize English inputs in addition to Cantonese. I'm wondering if it's possible for you to support importing HuggingFace models directly to whisper-burn? That way, it's easier to eliminate intermediate bugs during the conversion pipeline. Maybe the convert-h5-to-ggml <https://github.com/ggerganov/whisper.cpp/blob/master/models/convert-h5-to-ggml.py> from Whisper.cpp can come in handy? Thanks. — Reply to this email directly, view it on GitHub <#14>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AE7COZQ6QHSVFMNBXWMS54TX3O5QNANCNFSM6AAAAAA5A4MMMU>. You are receiving this because you are subscribed to this thread.

Answer 2 · 2023-09-22T00:19:26.000Z

The conversion script you mentioned seems to work. I ran some tests and the issue is that multilingual models do not work in general while the English-only models all seem to work. I have no idea what the reason is.

#Update
I realized that the issue is the tokenizer used. The multilingual models use a different tokenizer than the English-only models. I should have it working tomorrow.

Answer 3 · 2023-09-22T17:27:51.000Z

It should work now. I tested your model on the English sample audio and it now outputs what looks like pinyin without accents. Perhaps you fine-tuned it so strongly that it can no longer perform English transcription?

Answer 4 · 2023-09-22T23:34:47.000Z

Oh great, I will give it a try and see.

Answer 5 · 2023-09-23T00:09:38.000Z

I can verify that the decoder is working! Thanks a lot for your work! 🙏
There are two remaining issues:

Is there a way to adjust the decoding strategy like beam search vs greedy and the beam size etc? For one test audio, I saw some repetition at the end that might have to do with the decoding strategy:

nei tai keoi gin dou leng zai zau wan sai long $$$ audio ends here, repetition starts here $$$ nei tai keoi gin dou leng zai zau wan sai long nei

The inference is quite slow and uses only a single CPU core on macOS, even though I specified the wgpu_backend. I wonder if there's any setup I missed with regard to metal support? This is the command I ran:

cargo run --release --features wgpu-backend --bin transcribe small test_yue2.wav en transcription.txt

PS: I'm running on macOS Ventura on M1 Max

Answer 6 · 2023-09-23T14:48:56.000Z

Is there a way to adjust the decoding strategy like beam search vs greedy and the beam size etc? For one test audio, I saw some repetition at the end that might have to do with the decoding strategy:

It currently only uses greedy search. I may implement beam search if I have time.

The inference is quite slow and uses only a single CPU core on macOS, even though I specified the wgpu_backend. I wonder if there's any setup I missed with regard to metal support? This is the command I ran:

The WebGPU backend typically doesn’t use the CPU. In fact, specifying the CPU crashes on all my systems so I think it probably is using metal. Burn WGPU is not yet optimized so it runs MUCH slower than the torch backend. Even when optimized I’m not sure how close the WebGPU backend can come to CUDA kernels in terms of performance.

…

On Sep 22, 2023, at 8:09 PM, Xiang (Kevin) Li ***@***.***> wrote: I can verify that the decoder is working! Thanks a lot for your work! 🙏 There are two remaining issues: Is there a way to adjust the decoding strategy like beam search vs greedy and the beam size etc? For one test audio, I saw some repetition at the end that might have to do with the decoding strategy: nei tai keoi gin dou leng zai zau wan sai long $$$ audio ends here, repetition starts here $$$ nei tai keoi gin dou leng zai zau wan sai long nei The inference is quite slow and uses only a single CPU core on macOS, even though I specified the wgpu_backend. I wonder if there's any setup I missed with regard to metal support? This is the command I ran: cargo run --release --features wgpu-backend --bin transcribe small test_yue2.wav en transcription.txt PS: I'm running on macOS Ventura on M1 Max — Reply to this email directly, view it on GitHub <#14 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AE7COZT4ESFCJF4GNFQPNS3X3YSEZANCNFSM6AAAAAA5A4MMMU>. You are receiving this because you commented.

Answer 7 · 2023-09-23T16:44:57.000Z

I reactivated repeat detection. Let me know if you still encounter repetitions. I'll implement caching later which should improve the performance.

Answer 8 · 2023-09-24T13:08:40.000Z

Thanks for the explanation. I tried again using the latest commit and the repetition issue unfortunately persisted. I think it's because the current repetition filter does not apply to whole sentences like "nei tai keoi gin dou leng zai zau wan sai long". However, a maximum generation length argument might prevent the model from repeating at the end and also save some decoding time.

Answer 9 · 2023-09-24T13:55:01.000Z

May you send me an audio sample that shows repetitions? All of my samples no longer have repetitions.

…

On Sep 24, 2023, at 9:08 AM, Xiang (Kevin) Li ***@***.***> wrote: Thanks for the explanation. I tried again using the latest commit and the repetition issue unfortunately persisted. I think it's because the current repetition filter does not apply to whole sentences like "nei tai keoi gin dou leng zai zau wan sai long". However, a maximum generation length argument might prevent the model from repeating at the end and also save some decoding time. — Reply to this email directly, view it on GitHub <#14 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AE7COZTZLVXH7JJZCLHLKPDX4AWGFANCNFSM6AAAAAA5A4MMMU>. You are receiving this because you commented.

Answer 10 · 2023-09-25T19:51:54.000Z

I tested more and discovered a "trick" that reliably caused repetition: lengthen the last word. Here's a link to 3 audios that caused repetition on my model: https://on.soundcloud.com/2Qw3J The number of repetition seems to depend on the length of the last word and maybe the overall sentence length.

Answer 11 · 2023-10-09T01:08:33.000Z

I added a beam(ish) search and the transcription quality seems to have significantly improved. If you find the time to test it let me know how it goes.

Answer 12 · 2023-10-10T17:45:32.000Z

I tried the latest beam search but unfortunately the repetition issues persisted on the three soundcloud audio samples.

Answer 13 · 2024-10-12T22:22:07.000Z

@Gadersd I think you would have to implement something like LocalAgreement-n policy algorithm to get rid of repetitions. like here: https://github.com/ufal/whisper_streaming