alesaccoia/VoiceStreamAI

hallucinated words in the output

Opened this issue ยท 10 comments

While running the program during silence periods irrelevant output like "okay" and "thank you" appears.
Is there a way to fix this or is it a feature of faster-whisper

I have also noticed that behaviour.

I think the particular VAD model we're using, while conceptually it fits the project well, is almost useless at the moment, could be worth experimenting with other models.

For the time being, what I do is to read the language_probability. After a bit of experimenting, I've found out that setting a 0.9 threshold will basically prevent all the false positives.

websocketRecognition = new WebSocket(recognitionWebSocketAddress);
websocketRecognition.onmessage = function(event) {
    const response = JSON.parse(event.data);

    console.log(response.language_probability);

    if (response.language_probability > 0.9) {
        doSomethingWith(response.text);
    } else {
        console.log("Speech not recognized. Could be just noise or hallucinations");
    }
};

@alesaccoia

But when I use this for only English, language_probability is always 1.

What is the solution in this case?

I see, didn't try that. then maybe you could take a look at the word level probabilities to isolate the problem?

I understnad, I looked at word level probabilities but couldn't find a threshold.

So, do you suggest me to use multilanguage option?

@alesaccoia
Also, when I use multilanguage mode, will I meet additional latency?

Yes multiingual model does have additional latency, it first computes the audio language.
Although this is a very small extra latency compared to the transcription

Please add "vad_filter=True" to fix the problem
Exam:
model.transcribe(fileName, beam_size=10, language="vi", vad_filter=True)

@KZyred didn't know that argument existed. did you test that?

Indeed it helps, however is not any miracle option.
It does decrease latency for long audios.
Other types of hallucination may appear, play with vad filter options.