tazz4843/whisper-rs

No VAD (Voice Activity Detection)

SeanEClarke opened this issue · 9 comments

To improve performance/accuracy it is useful to perform some VAD on the incoming audio and only pass speech segments to be processing i.e. remove periods of silence which are known to cause hulicinations etc.

Not sure what the best approach would be, use whisper-cpp's basic implementation or pull in WebRTC and use VAD from there (webrtc-vad

I have done something similar in my own project using https://crates.io/crates/nnnoiseless.

I did some digging into this for myself and came across nnnoiseless as well. I can see the utility of having it in whisper-rs to avoid dealing with it yourself, if that's something you'd be interested in I can look into it.

I think the integration will be key, just putting a VAD on the front will achieve a certain amount, however a well integrated VAD whereby things like the timestamps take the VAD workings into account i.e. the timestamps are correct to the original audio, even though portions were defeated via the VAD.

Does that make sense?

I see what you mean. That seems a bit difficult to implement but I'll see what's possible.

Just my 2c, and maybe this would be a better github discussion than ticket. whisper_rs would be best kept as a pure "rust" wrapper for whisper.cpp. If there are scenarios like using VAD to strip silence from a file to make the transcription faster, then they could be examples or consider a second crate which has the extra bells and whistles.

NOTE: I've tried stripping silence and depending on where and how you strip silence, it can make the transcript quality worse. If you are trying to transcribe long content faster, a chunking+parallel transcription approach is used by the main binary in the whisper.cpp repo gets great results.

Without hijacking the thread, I'd be more interested in having a branch that users of whisper_rs can use in their Cargo.toml, which more closely tracks the main branch on whisper.cpp. There are some significant performance updates, especially the metal implementation for M1 Mac.

Just my 2c, and maybe this would be a better github discussion than ticket. whisper_rs would be best kept as a pure "rust" wrapper for whisper.cpp. If there are scenarios like using VAD to strip silence from a file to make the transcription faster, then they could be examples or consider a second crate which has the extra bells and whistles.

Any such feature would be behind a feature flag, but an external crate may be a better idea. However there could be issues with that if it needs to access some internal state. Not sure how to go about that.

Without hijacking the thread, I'd be more interested in having a branch that users of whisper_rs can use in their Cargo.toml, which more closely tracks the main branch on whisper.cpp. There are some significant performance updates, especially the metal implementation for M1 Mac.

I do a big batch update every few weeks/months: current one is at #85, branch whisper-8e46ba8

So, doing some digging - it looks like whisper has very basic VAD, the original whisper project lists:

options['no_speech_threshold'] = 0.275
options['logprob_threshold'] = None

as example of how to set them, Looking back through whisper-cpp it looks like it is hardcoded to a set value:
/*.no_speech_thold =*/ 0.6f,

so there could be some hooks in there. However, following from some of the discussions, this looks very basic and can often need tweaking dependant on teh audio. WebRTC is old and seems geared more towards noise detection for example on IP/VOIP telephony and silence suppression etc. A more popular/modern approach is to actually detect voice rather than use some threshold or FFT analysis, recently AI has been incorporated, something like Silero which seems to be growing in popularity and is mentioned in the Whisper discussions.

Either Way, the common approach suggested by the Whisper folks seems to be to "add something on" - I think it would be best to follow suit, so happy to close this and if Whisper changes direction and has a built in solution then we could look at it again.

It would be good, just for completeness to wire in the FullParams (not implemented) speech threshold into the hooks within whisper-cpp (which at the moment is hardcoded).

I would recommend you use silero-vad, which boasts good performance and accuracy.

I use it to detect start and stop positions of voice segments in long audio to ensure(mostly) a segment I pass for recognition doesn't start or stop in the middle of a word (I adjust the window if it does). In my case I run it under the MS ONNX runtime using pykeio-ort. Using ort it's something like this:

pub fn is_voice_segment_f32(&self, slice: &[f32], threshold: f32) -> SystemSyncResult<bool> {
        let window_size_samples = 512;
        let num_windows = slice.len() / window_size_samples;

        let tensor: Array2<f32> = Array::from_shape_vec((1, slice.len()), slice.to_vec())?;

        let mut h = Array::zeros((2, 1, 64));
        let mut c = Array::zeros((2, 1, 64));

        let rate_num = match self.sample_rate {
            SampleRate::Rate8khz => 8000 as i64,
            SampleRate::Rate16khz => 16000 as i64,
        };

        let sr = Array::from_elem((), rate_num);

        let mut output: Vec<f32> = vec![];

        for ix in 0..num_windows {
            let start = ix * window_size_samples;
            let end = start + window_size_samples;

            let window = tensor.slice(s![0, start..end]).to_owned();
            let window_len = window.len();
            let window = window.into_shape((1, window_len))?;

            let window_cow = CowArray::from(window.clone().into_dyn());
            let sr_cow = CowArray::from(sr.clone().into_dyn());
            let h_cow = CowArray::from(h.clone().into_dyn());
            let c_cow = CowArray::from(c.clone().into_dyn());

            let inputs = vec![
                Value::from_array(self.session.allocator(), &window_cow)?,
                Value::from_array(self.session.allocator(), &sr_cow)?,
                Value::from_array(self.session.allocator(), &h_cow)?,
                Value::from_array(self.session.allocator(), &c_cow)?,
            ];

            let mut outputs = self.session.run(inputs)?;
            let output_tensor = outputs[0].try_extract::<f32>()?;
            match output_tensor.view().first() {
                None => {
                    log!(
                        Level::Error,
                        "Unable to fetch output value from output tensor!"
                    );
                }
                Some(v) => {
                    log!(Level::Debug, "Voice={}, Threshold={}", v, threshold);
                    if *v > threshold {
                        return Ok(true);
                    }
                }
            }

            let h_tensor = outputs[1].try_extract::<f32>()?;
            let h_raw_data = h_tensor.view().iter().cloned().collect::<Vec<f32>>();
            h = Array::from_shape_vec((2, 1, 64), h_raw_data)?;

            let c_tensor = outputs[2].try_extract::<f32>()?;
            let c_raw_data = c_tensor.view().iter().cloned().collect::<Vec<f32>>();
            c = Array::from_shape_vec((2, 1, 64), c_raw_data)?;
        }

        return Ok(false);
    }

Running it under tract would be better though for this kind of crate as the ORT/onnxruntime is a lot of dependencies for this kind of thing. It's not entirely clear if it's functioning under tract though.