shirayu/whispering

VAD (Voice Activity Detection) to reduce repetition outputs for silent periods and call of `transcribe`.

shirayu opened this issue ยท 10 comments

Voice Activity Detection can reduce call of transcribe.
The VAD should be light-weight.

See also openai/whisper#29 (comment)

@shirayu in which part of the whisper, a third party VAD could be included? In this, block right? https://github.com/openai/whisper/blob/main/whisper/transcribe.py#L166

@fcakyon I'm not very sure, but I think it is better to run VAD before calculation of log_mel_spectrogram.
https://github.com/openai/whisper/blob/main/whisper/transcribe.py#L82

Perhaps inspiration could be taken from the mic_vad_streaming.py DeepSpeech example?

https://github.com/mozilla/DeepSpeech-examples/blob/r0.9/mic_vad_streaming/mic_vad_streaming.py

Thank you for your information.
While the package used in the example is webrtcvad, webrtcvad-wheels seems to have been maintained even recently.

https://github.com/daanzu/py-webrtcvad-wheels
https://pypi.org/project/webrtcvad-wheels/

I found another issue with the current Whispering/WhisperAI, sometimes the AI will become stuck on repeating the last sentence after silence sometimes even if proper speech audio is given to it. I have seen this happen after 10-15mins of normal speech, then 30 seconds of silence/whispering/mumbling, and then back to easy to decode audio. Haven't seen this before, but I am messing with some of the options. I hope there will be a future way to print a blank line, or restart the AI if the same line gets repeated more than 2-3 times to avoid this 'typical' lang-AI looping lock.

Had to restart the AI, but went back to transcribing things fairly well after, with '--n 80, --add-padding' for my options (RTX 3080ti). Otherwise only a few errors in a 40min live transcribing session with multiple people, accents, volumes and the speedup parameter

Would definitely be nice to have continuous results appending as they come and maybe a "final" during periods of silence. This is what coqui stt does.

This is probably one of the fastest I have used personally, but it is nodejs. They have python examples as well.
https://github.com/coqui-ai/STT-examples/blob/r1.0/nodejs_mic_vad_streaming/start.js

Basically just allow the VAD to decide via a config variable on the silence duration.

I will use https://github.com/snakers4/silero-vad as mentioned here.

Cureently (9fb7896) if an entire audio section is silent, the section will not be passed to whisper.
To disable this behavior, the --no-vad option has been added.

A further improvement is to remove silent parts from all audio sections and pass only the sound parts to whisper.

I made the default value of -n 20. (c435d50)
The default value of --allow-padding is False (Whisper analyzes each 30 seconds).

This means whispering just performs VAD for the period(-n 20 = 3.75 sec) and if it is predicted as silence, just remove it from audio segments to pass Whisper.
If you want to change the VAD interval, simply change -n.

I think this is a decent start for a real ASR server, but since it isn't performing buffering and building on the audio as the person is speaking it could never be used for real time decoding. Am I understanding you correctly?

Otherwise it will only have a fix -n where this value should be simply built based on a VAD and if you have a continuous buffer you could decode "partial" results and then build a "final" result based on time between speaking and silences if that makes sense.

So for example,
user starts speaking
-> VAD actives buffer to save complete audio data frames
-> whispering performs continuous results this would be a "partial" result. These would need be asynchronous in its own thread as not to cause slow down.
-> user stops speaking and gives "final" result with full transcript. The final transcription would also need to be asynchronous not to slow down transcription because the user could start talking again.

The goal here would be to have it be as real time as possible.

I did tests here and it looks like it requires an int value. Since it requires an int value, if some one spoke for .5 seconds it would take 1 second to return a result. Seems like a broken way to do it.

The original Whisper is supposed to analyze every 30 seconds (including silent periods).
This VAD just drops silent periods to reduce call of transcribe (and fix #4).
Not for real time decoding.

Off course, frequent return of partial analysis is useful.
It will be resolved by #8.