alesaccoia/VoiceStreamAI

Enhancing "Silence at the End of Chunk (s)" Strategy in Continuous Speech Recognition

Closed this issue · 3 comments

The current implementation of the "Silence at the End of Chunk (s)" strategy in our continuous speech recognition system is not performing well when there are minimal breaks in the speech. Upon testing, it was observed that the silence needs to be strictly at the end of the chunk, which becomes increasingly unlikely as the chunk size grows. Here are some proposed ideas to enhance this strategy:

  1. Waiting for any moment of silence, regardless of its position within the chunk.
  2. Implementing a maximum length for the buffered audio.
  3. Utilizing a sliding window strategy, where the audio is transcribed with some overlap, and merging the two segments using tools like mistral 7b.

These improvements aim to make the speech recognition system more reliable and effective in scenarios where continuous speech is prevalent.

Hey Kirill, totally with you on that. I was thinking the same - a mix of these ideas, plus adding timestamps to piece together the sliding windows. Like I mention in #2, setting up a unit test first to get a solid ground truth and automate testing seems like the first task.

Also, I'm on the lookout for a self-hosted, multi-language LLM that's not Mistral. I personally need something that'll run smooth on two 16 Gb Tesla - planning to dedicate one to the LLM and the other to VAD/ASR. But Mistral's just too big for even one card, and in any case inference can be very slow. Any ideas? Plus, there's room to play around with the idea of using probability-based reasoning on overlapping, timestamped tokens. This way, the LLM could chip in only when the overlapped words are very different.

I was able to run the project in parallel with mistral-7b-v0.1.Q4_K_M with a lot of room to spare on an M3 Max 64gb. I haven't measured the transcription but I believe it is slightly slower than realtime which makes the VAD more useful. If I recall correctly the 4090 is the only card that can provide realtime transcription, but if you throw in the sliding window to improve transcription it might not be enough for true non stop speech.

I wonder if resampling the audio at x1.5 you could get the best of both words as long as the original speech is not too fast.

The new default is the faster-whisper model that works in realtime with chunks of 3 or 5 seconds. Works pretty well, check out the video