ggerganov/whisper.cpp

Spam Attack

DariusAlexander opened this issue · 2 comments

Noticed there's prediction outputs that include spam:

start,end,text
0,8640," 6 greens of fresh snow peas, 5 thick slabs of blue cheese and maybe a snack for her brothered"
8640,9000," Bob."
9000,16000," For more information visit www.beadaholique.com to purchase beading supplies and to get design ideas!"
16000,23000," www.beadaholique.com to purchase beading supplies and to get design ideas!"
23000,30000," www.beadaholique.com to purchase beading supplies and to get design ideas!"

Source audio file is 30s long, zero-padded on the end with about 20s of (absolute) silence.

I followed the Quick Start guide:
git clone https://github.com/ggerganov/whisper.cpp.git
bash ./models/download-ggml-model.sh base.en
make
./main -ocsv -f myfile.wav

I've just started looking at this project, so I don't know the problem deeply, but seems the model downloaded by /models/download-ggml-model.sh (https://huggingface.co/ggerganov/whisper.cpp) might be the issue.

I have been contemplating for the past two months on how to utilize limited resources: 4 * V100, 4 * P100, 2 * 2080ti, and 200 A100 card hours gifted to me by someone else, to partially solve these issues. Whisper sometimes experiences severe hallucinations; you can check this paper https://arxiv.org/pdf/2402.08021. The reason for these severe hallucinations is that Whisper itself is trained on a weakly labeled dataset with considerable noise, making it prone to learning irrelevant information. My current idea is to distill Whisper Large v2, use it to label datasets, then clean those datasets using LLM and other neural networks. Finally train a new Whisper based on the Mixture of Experts (MoE) architecture. However, I'm not entirely sure if this approach will be successful.

The current vocabulary of Whisper is still too small, now only 60K, which will affect the performance of the model. Also, the context is too small, currently only 448 tokens, which needs to be expanded.