nyrahealth/CrisperWhisper

Does introducing noise-only samples in training reduce hallucinations?

Closed this issue · 6 comments

I would like to know if this approach is truly effective in mitigating hallucinations in the Whisper model.

Yes, it helps quite a bit, but it does not completely eliminate all hallucinations. I am currently exploring different approaches to make the trained cross-attention heads more effective at detecting hallucinations in a robust way.

Since these attention heads were actually trained, I would expect them to exhibit some "unusual" behavior, such as having increased entropy in their cross-attention distribution when hallucinated content is predicted.

One simple heuristic that could be implemented on top of the current model is this: if a sequence of words has a very short duration (as indicated by the timestamps), these words are likely hallucinated.

If you come across audio where the model starts hallucinating, I would be very interested in seeing those clips! :)

I wanna fine-tune original Whisper model using my own dataset with noise-only samples to reduce hallucinations. Is this possible?

Yes this is certainly possible :)

You will have to be a careful tough and add a meaningful amount of additional data in the language(s) you are interested in to not degrade the performance of the base model. Happy tuning!

First, I used the aishell corpus to fine-tune whisper, and for noise data, I used the FSDnoisy18k Dataset and random Gaussian noise.
I randomly selected noise from noise data, added it to the original speech data, and used it to generate a noise-only sample. Is that OK?
Second, do I need to use the same audio files from AphasiaBank to validate hallucination mitigation? Are there any other methods?

I randomly selected noise from noise data, added it to the original speech data, and used it to generate a noise-only sample. Is that OK?

Not quite, noise only samples contain no speech. Therefore adding noise with speech will not result in a noise only sample.

Please carefully study section three, especially 3.2 of the paper. The details are given there.
https://arxiv.org/pdf/2408.16589

Sorry, I mean, I used noise data to generate a noise-only sample.