Does introducing noise-only samples in training reduce hallucinations?
Closed this issue · 6 comments
I would like to know if this approach is truly effective in mitigating hallucinations in the Whisper model.
Yes, it helps quite a bit, but it does not completely eliminate all hallucinations. I am currently exploring different approaches to make the trained cross-attention heads more effective at detecting hallucinations in a robust way.
Since these attention heads were actually trained, I would expect them to exhibit some "unusual" behavior, such as having increased entropy in their cross-attention distribution when hallucinated content is predicted.
One simple heuristic that could be implemented on top of the current model is this: if a sequence of words has a very short duration (as indicated by the timestamps), these words are likely hallucinated.
If you come across audio where the model starts hallucinating, I would be very interested in seeing those clips! :)
I wanna fine-tune original Whisper model using my own dataset with noise-only samples to reduce hallucinations. Is this possible?
Yes this is certainly possible :)
You will have to be a careful tough and add a meaningful amount of additional data in the language(s) you are interested in to not degrade the performance of the base model. Happy tuning!
First, I used the aishell corpus to fine-tune whisper, and for noise data, I used the FSDnoisy18k Dataset and random Gaussian noise.
I randomly selected noise from noise data, added it to the original speech data, and used it to generate a noise-only sample. Is that OK?
Second, do I need to use the same audio files from AphasiaBank to validate hallucination mitigation? Are there any other methods?
I randomly selected noise from noise data, added it to the original speech data, and used it to generate a noise-only sample. Is that OK?
Not quite, noise only samples contain no speech. Therefore adding noise with speech will not result in a noise only sample.
Please carefully study section three, especially 3.2 of the paper. The details are given there.
https://arxiv.org/pdf/2408.16589
Sorry, I mean, I used noise data to generate a noise-only sample.