shirayu/whispering

I am unable to get it running on my machine (CPU)

ninjalu opened this issue · 10 comments

Description

I installed whispering and followed the instructions, however I am not able to get any output. All I get is "No speech", which is clearly not right

Logs (Optional)

[2022-11-04 15:23:27,443] vad.__call__:56 DEBUG -> VAD: 0.010574953630566597 (threshold=0.5)
[2022-11-04 15:23:27,443] transcriber.transcribe:248 DEBUG -> No speech
[2022-11-04 15:23:27,443] transcriber.transcribe:258 DEBUG -> nosoeech_skip_count: None (<= 16)
[2022-11-04 15:23:27,443] cli.transcribe_from_mic:67 DEBUG -> Audio #: 7, The rest of queue: 0
[2022-11-04 15:23:31,274] cli.transcribe_from_mic:82 DEBUG -> Got. The rest of queue: 0
Analyzing[2022-11-04 15:23:31,275] transcriber.transcribe:235 DEBUG -> 60000
[2022-11-04 15:23:31,310] vad.__call__:56 DEBUG -> VAD: 0.010565487667918205 (threshold=0.5)
[2022-11-04 15:23:31,310] transcriber.transcribe:248 DEBUG -> No speech
[2022-11-04 15:23:31,310] transcriber.transcribe:258 DEBUG -> nosoeech_skip_count: None (<= 16)
[2022-11-04 15:23:31,310] cli.transcribe_from_mic:67 DEBUG -> Audio #: 8, The rest of queue: 0
[2022-11-04 15:23:34,948] cli.transcribe_from_mic:82 DEBUG -> Got. The rest of queue: 0
Analyzing[2022-11-04 15:23:34,948] transcriber.transcribe:235 DEBUG -> 60000
[2022-11-04 15:23:34,979] vad.__call__:56 DEBUG -> VAD: 0.010574160143733025 (threshold=0.5)
[2022-11-04 15:23:34,979] transcriber.transcribe:248 DEBUG -> No speech
[2022-11-04 15:23:34,979] transcriber.transcribe:258 DEBUG -> nosoeech_skip_count: None (<= 16)
[2022-11-04 15:23:34,979] cli.transcribe_from_mic:67 DEBUG -> Audio #: 9, The rest of queue: 0

Environment

Mac M1

  • OS:
  • Python Version: 3.9
  • Whispering version: 0.6.3

No speech is the output of VAD.
How about to disable VAD to set --vad 0?

This is what I get with
whispering --language en --model small --debug --vad 0

I don't know what I should be expecting, but I suspect some transcription of what I say to mic, but I only get repeated logs as below

Analyzing[2022-11-07 10:55:09,996] transcriber.transcribe:235 DEBUG -> 60000
[2022-11-07 10:55:09,998] transcriber.transcribe:266 DEBUG -> Incoming new_mel.shape: torch.Size([80, 375])
[2022-11-07 10:55:09,998] transcriber.transcribe:270 DEBUG -> buffer_mel.shape: torch.Size([80, 2250])
[2022-11-07 10:55:09,998] transcriber.transcribe:273 DEBUG -> mel.shape: torch.Size([80, 2625])
[2022-11-07 10:55:09,998] transcriber.transcribe:277 DEBUG -> seek: 0
[2022-11-07 10:55:09,998] transcriber.transcribe:282 DEBUG -> mel.shape (2625) - seek (0) < N_FRAMES (3000)
[2022-11-07 10:55:09,999] transcriber.transcribe:288 DEBUG -> No padding
[2022-11-07 10:55:09,999] transcriber.transcribe:345 DEBUG -> ctx.buffer_mel.shape: torch.Size([80, 2625])
[2022-11-07 10:55:09,999] cli.transcribe_from_mic:67 DEBUG -> Audio #: 7, The rest of queue: 0
[2022-11-07 10:55:13,824] cli.transcribe_from_mic:82 DEBUG -> Got. The rest of queue: 0

How long have you waited?
By the default, it needs to wait at least 30 seconds.

https://github.com/shirayu/whispering#parse-interval

By default, Whisper does not perform analysis until the total length of the segments determined by VAD to have speech exceeds 30 seconds. However, if silence segments appear 16 times (the default value of --max_nospeech_skip) after speech is detected, the analysis is performed.

Thanks! I got it running now. However, I noticed the transcription gets repeated (corrected?) for 4-5 timestamp intervals before moving on to the next chunk. Is that expected? Is there a way you could only allow one output from all the different versions?

136.98->139.06	 long you will discover in fact that it's
139.06->141.60	 not possible because before long you
141.60->143.28	 will discover in fact that it there's not
143.28->145.52	 possible. Because before long you will
145.52->146.98	 discover in fact that that there's not
146.98->149.00	 possible. Because before long you
149.00->150.64	 will discover in fact that that there's
150.64->152.82	 not possible. Because before long you
152.82->154.38	 will discover in fact that it there's
154.38->156.36	 not possible. Because before long you
156.36->158.14	 will discover in fact that it there's
158.14->160.20	 not possible. Because before long you
160.20->166.20	 you will discover it is very well possible that
166.20->171.20	 then it is very much stuff and we're not just going to know we are going to release stuff
171.20->175.20	 And we are not just going to know we are going to release stuff
175.20->179.20	 and we are not just going to know we are going to release stuff
179.20->182.20	 And we are not going to know we are going to release stuff
182.20->186.20	 and we are not just going to know we are going to release stuff
186.20->190.20	 and we are not just going to know we are going to release stuff
190.20->193.70	 We are not just going to know we are going to release stuff
193.70->197.30	 Now you can say Good Fear, openly I didn't know at the moment.```

Thanks!

Does the original whisper work with the sound?
If not, it might be related to the representation problem that is reported here.

Thanks! That does explain some of the observations of repetition I have!
Another question is regarding to your earlier commend about 30secs.

By default, Whisper does not perform analysis until the total length of the segments determined by VAD to have speech exceeds 30 seconds.

Is there anyway, I could reduce the 30 second rule (maybe just 10 secs), so it performs even more like streaming?

Many thanks!

Yes you can.
I added --frame option in whispering v0.6.4.

The default value is 3000 (i.e. 30 seconds) and you can make it smaller.
However, it will sacrifice accuracy because this is not expected input for Whisper.

This issue is stale because it has been open for 21 days with no activity.

What was the command that you got running on the m1 mac?

This issue is stale because it has been open for 21 days with no activity.