absadiki/pywhispercpp

Tool is super slow / runs forever

Opened this issue · 10 comments

I'm trying to transcribe the audio of a 45s mp3 of the audio of a YouTube Short.
I'm doing it like this:

from pywhispercpp.model import Model
model = Model('base.en', print_realtime=False, print_progress=True, n_threads=6)
segments = model.transcribe(short_audio_file, speed_up=True, new_segment_callback=print)

It runs forever, doesn't end and this is all the output I get. Then it just keeps running for seemingly nothing. CPU is at 100%:

[2024-01-09 23:28:50,941] {utils.py:38} INFO - No download directory was provided, models will be downloaded to [/home/marius/.local/share/pywhispercpp/models](https://file+.vscode-resource.vscode-cdn.net/home/marius/.local/share/pywhispercpp/models)
[2024-01-09 23:28:50,943] {utils.py:46} INFO - Model base.en already exists in [/home/marius/.local/share/pywhispercpp/models](https://file+.vscode-resource.vscode-cdn.net/home/marius/.local/share/pywhispercpp/models)
[2024-01-09 23:28:50,944] {model.py:221} INFO - Initializing the model ...
whisper_init_from_file_no_state: loading model from '/home/marius/.local/share/pywhispercpp/models/ggml-base.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: type          = 2
whisper_model_load: mem required  =  310.00 MB (+    6.00 MB per decoder)
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx     =  140.60 MB
whisper_model_load: model size    =  140.54 MB
whisper_init_state: kv self size  =    5.25 MB
whisper_init_state: kv cross size =   17.58 MB
[2024-01-09 23:28:52,186] {model.py:130} INFO - Transcribing ...

Any ideas what could be wrong or how to improve the speed? Thanks for any help. I appreciate it. This is the most promising of the python bindings for whisper.cpp as the others don't even build anymore...

Seems like the model is loaded successfully, so it's weird why it runs forever!
Is the short_audio_file var holds the path to your mp3 file as a str ? Have you tried other files and you always run into the same issue ?

Hi, thanks for your reply. Yes, it's a str that holds the path. I haven't tried another file, but honestly it's a pretty basic mp3 of just spoken text with no additional sounds.

You can try yourself, I attached the file (It's zipped, so I can upload it to GitHub. It's not from my own video. It's a random Short from YouTube. So enjoy some Dragonball content)
input_short.zip

Hi @03l54rd1n3,
Thanks for providing the file, it took less than 4s on my machine to generate the results :

{model.py:133} INFO - Inference time: 3.481 s
[t0=0, t1=242, text=Why does Vegeta always hold his left arm?, t0=242, t1=528, text=Vegeta has multiple poses that are very distinctive of him,, t0=528, t1=778, text=for instance the infamous self-pointing thumb., t0=778, t1=1194, text=A very different one however, is that in which he holds his left arm in pain., t0=1194, t1=1604, text=Vegeta has gone through a lot of different battles and has sustained a crazy amount of injuries., t0=1604, t1=2158, text=But for some reason most of the time he always ends up holding his left arm as if he had some sort of chronic pain., t0=2158, t1=2504, text=A lot of people thought back then that this was because of Andrew at 18,, t0=2504, t1=2690, text=who really did a number on his left arm., t0=2690, t1=2784, text=Nevertheless,, t0=2784, t1=3168, text=it is possible to see Vegeta holding his left arm already in the namics saga., t0=3168, t1=3578, text=This implies that if Vegeta really does have some sort of chronic injury in his left arm,, t0=3578, t1=3800, text=then it must be previous to the android saga., t0=3800, t1=4268, text=Also, this is an injury that no sends a beans or dragon ball resurrection has been able to heal,, t0=4268, t1=4564, text=so whatever it is, it must be deeply rooted within his body.]

So there is something wrong with your installation.

Do you have ffmpeg installed ?

Hi, thank you for your reply. Yes it's installed through apt. And I installed your tool through pip. Wonder what it is then... I'll check the whisper.cpp requirements as well...

Yes, try to compile and run whisper.cpp first and let me know if that works.

OK, your tool works fine from the CLI (pwcpp). Original whisper.cpp also works. Seems like the unexpected behavior is just in python (script file or notebook). Any idea why it only happens there?

Correction, it happens in python when using the n_threads argument. Without that it works. The tools seems to deadlock. I'm on linux if that is relevant for you.

I only use Linux as well and this never happened.
But how many threads does your CPU support ?

good question. I have 4 cores. It's some 7th gen Intel i7, not the best, but with 16GB of RAM, the laptop still manages most tasks pretty well.

I just tried a couple of times again. In the python script it actually now works with n_threads set to 2 or 4. In the notebook with it set to 1 or 2, sometimes I get to transcribing, but no results. Sometimes it gets locked before that at kv cross size.

Yeah it's good, but obviously you cannot go above your resources, so n_threads should not exceed 4 (which is the default by the way).
So as long as it's running in a script then everything is good, you have to check your Jupyter notebook environment, I have also re-checked now in colab notebooks and it's working without any problem.