FL33TW00D/whisper-turbo

[BUG] medium model will not output complete results

GuuuWei opened this issue · 15 comments

Describe the bug

  • Medium model will not output complete results
    • The larger the model, the higher the probability of bug.
    • which occasionally happens in small one, etc.
    • Happens almost 99% when I use medium model for audio longer than 1 minute.
    • Occurs in most language audio.
    • I suspect there is some timeout or leakage.

To Reproduce
Steps to reproduce the behavior:

  1. Select the medium model
  2. Select test.mp3
  3. Setting the ’zh‘ language may reduce the chance of problems.
  4. The console only shows the processing but does not return json.

Desktop (please complete the following information):

  • OS: [Windows]
  • Browser [Chrome 119]
  • Also tried compiling locally

image
test-audio.zip

@from-gu-wei

Thanks for the good bug report, most likely a Windows issue.

Will report back.

@from-gu-wei
This is a symptom of the quantization, the network runs fine in FP32.

Will have to investigate deeper - apologies.

@from-gu-wei

The problem here is:

  • Quantized model gets caught in loops much easier
  • EOT token is never generated so segment is never sent back to the user.

Not sure there is more i can do here, the quantized model performs the best it can on lower resource languages.
large-v3 improves multilingual performance significantly and should be more resistant to quantization.

Please provide more sample audio if you think my reasoning is incorrect here.

@from-gu-wei

The problem here is:

  • Quantized model gets caught in loops much easier
  • EOT token is never generated so segment is never sent back to the user.

Not sure there is more i can do here, the quantized model performs the best it can on lower resource languages. large-v3 improves multilingual performance significantly and should be more resistant to quantization.

Please provide more sample audio if you think my reasoning is incorrect here.

I don't have enough knowledge to give effective advice.
In fact, I am a designer and only know a little basic development knowledge.
I can provide a lot of audio:
test-audio-2.zip

The ones below are too big and you need to monitor the network to download them:
#173 雑談回 旅行で感じた住みやすい環境を話したら対照的だった
Full Cycle Developers at Netflix, 미국 연봉 정보 levels.fyi

image
image
image

  • Setting the language tag may reduce the chance of problems.

image
image

@from-gu-wei
Thank you for providing more audio and the detailed reports!

Whisper is pretty bad at automatically detecting the language, I would recomend always setting it.

Additionally, I would recommend providing the audio in WAV format only. You can convert it with the following command:

ffmpeg -i yourmp3.mp3 -acodec pcm_s16le -ar 16000 -ac 1 yourwav.wav

Please let me know if WAV converting helps.

@FL33TW00D

No, WAV converting does not work.

image

@FL33TW00D
I tested it on mac M1, and it just keeps outputting "!"

image
image

@from-gu-wei,

Thanks for bearing with me.

I've compared my implementation against the OAI implementation using your provided:
Full Cycle Developers at Netflix, 미국 연봉 정보 levels.fyi

It seems the smaller models are simply bad at lower resource languages.
The transcript is terminated due to the model getting caught in a bad loop.

I can improve the handling of this, but unfortunately the quality of zh transcription will remain poor until I ship large-v3 which should handle things much better.

Apologies for the inconvenience, I will keep the issue open as I add the 1 or 2 mitigations I can.

PS:
Can you please provide the sample that repeatedly outputs "!", this seems like a failure on my end.

@FL33TW00D

Thanks for all your work, once it's perfect it will change the ecology of many products. I also want to try to develop a little product based on this.


For me personally

  • It can give stable results, even if the quality is not good, it doesn't matter.
    • By the way, is there any way to introduce VAD to solve the hallucination problem?
  • And large-v3 may improve quality, but is too heavy for the out-of-the-box web.
  • In fact, I'm considering using the small model if it can be even smaller.

PS: About "!", what information do I need to provide, it can be reproduced stably on my mac.
The previous case was the audio file I provided earlier

@FL33TW00D

Thanks for all your work, once it's perfect it will change the ecology of many products. I also want to try to develop a little product based on this.

For me personally

  • It can give stable results, even if the quality is not good, it doesn't matter.

    • By the way, is there any way to introduce VAD to solve the hallucination problem?
  • And large-v3 may improve quality, but is too heavy for the out-of-the-box web.

  • In fact, I'm considering using the small model if it can be even smaller.

PS: About "!", what information do I need to provide, it can be reproduced stably on my mac. The previous case was the audio file I provided earlier

Stable results should be possible, the model will need to reject samples using compression_ratio_threshold which needs to be added to DecodingOptions: https://www.ratchet.sh/whisper-turbo#decoding-options

For the "!" case, please provide the audio file (It can be trimmed to a shorter length if it produces "!" from the start).

Thanks again,
Chris

@FL33TW00D

Well, this is some issue with Arc Browser.
It works fine after I switch to Chrome on mac.


What I mean is that any audio, including the Japanese and Korean podcasts above, will be output "!"

@from-gu-wei OK interesting! I haven't tested with arc!

@from-gu-wei OK to close? I will put the mitigations on the roadmap!

@FL33TW00D

Of course no problem, thanks for your work.

If I want to inquire about something about this project, am I still raising an issue? This is my first time participating in an open source project, and there are many things I don’t understand.

@FL33TW00D

Of course no problem, thanks for your work.

If I want to inquire about something about this project, am I still raising an issue? This is my first time participating in an open source project, and there are many things I don’t understand.

Yes please raise anything and everything! The project is still young.