Gadersd/whisper-burn

Trouble importing finetuned HuggingFace model

AlienKevin opened this issue · 13 comments

Hi, thanks for developing this awesome Whisper implementation! I'm looking to deploy a small Whisper model I finetuned using HuggingFace transformers. The model is supposed to generated cantonese romanizations and the language is set to English during training because they share the same ascii letters. The primary motivation is to take advantage of burn's wgpu backend for cross platform deployment to both iOS and Android users. Prior to trying your library, I managed to get my finetuned model running on iOS using whisper.cpp but I'd prefer a rust backend for portability.

For my experiment with importing the model into whisper-burn, I first converted the HuggingFace model to Whisper's pt format using a script (See step 1 of this issue). And then I followed the steps in the README and successfully converted the model to the burn format. However, when I run inference using my model, it produced garbage transcripts on the provided audio16k.wav as well as on my own test audio. For example, the audio16k.wav produced a transcript of "onbed" when normally the model should recognize English inputs in addition to Cantonese.

I'm wondering if it's possible for you to support importing HuggingFace models directly to whisper-burn? That way, it's easier to eliminate intermediate bugs during the conversion pipeline. Maybe the convert-h5-to-ggml from Whisper.cpp can come in handy? Thanks.

The conversion script you mentioned seems to work. I ran some tests and the issue is that multilingual models do not work in general while the English-only models all seem to work. I have no idea what the reason is.

#Update
I realized that the issue is the tokenizer used. The multilingual models use a different tokenizer than the English-only models. I should have it working tomorrow.

It should work now. I tested your model on the English sample audio and it now outputs what looks like pinyin without accents. Perhaps you fine-tuned it so strongly that it can no longer perform English transcription?

Oh great, I will give it a try and see.

I can verify that the decoder is working! Thanks a lot for your work! 🙏
There are two remaining issues:

  1. Is there a way to adjust the decoding strategy like beam search vs greedy and the beam size etc? For one test audio, I saw some repetition at the end that might have to do with the decoding strategy:
nei tai keoi gin dou leng zai zau wan sai long $$$ audio ends here, repetition starts here $$$ nei tai keoi gin dou leng zai zau wan sai long nei
  1. The inference is quite slow and uses only a single CPU core on macOS, even though I specified the wgpu_backend. I wonder if there's any setup I missed with regard to metal support? This is the command I ran:
cargo run --release --features wgpu-backend --bin transcribe small test_yue2.wav en transcription.txt

PS: I'm running on macOS Ventura on M1 Max

I reactivated repeat detection. Let me know if you still encounter repetitions. I'll implement caching later which should improve the performance.

Thanks for the explanation. I tried again using the latest commit and the repetition issue unfortunately persisted. I think it's because the current repetition filter does not apply to whole sentences like "nei tai keoi gin dou leng zai zau wan sai long". However, a maximum generation length argument might prevent the model from repeating at the end and also save some decoding time.

I tested more and discovered a "trick" that reliably caused repetition: lengthen the last word. Here's a link to 3 audios that caused repetition on my model: https://on.soundcloud.com/2Qw3J The number of repetition seems to depend on the length of the last word and maybe the overall sentence length.

I added a beam(ish) search and the transcription quality seems to have significantly improved. If you find the time to test it let me know how it goes.

I tried the latest beam search but unfortunately the repetition issues persisted on the three soundcloud audio samples.

@Gadersd I think you would have to implement something like LocalAgreement-n policy algorithm to get rid of repetitions. like here: https://github.com/ufal/whisper_streaming