futo-org/whisper-acft

Quality suffers on earnings22 dataset

soupslurpr opened this issue · 2 comments

whisper-tiny.en gets 18 WER without dynamic audio context on https://huggingface.co/datasets/distil-whisper/earnings22 (chunked, test) using evaluation.ipynb while acft-whisper-tiny.en with dynamic audio context gets 318 WER. This indicates that the acft fine tuned model with dynamic audio context may not work well in real-world conditions which include diverse accents and varying speech conditions.

Not sure why but changing ADD_AUDIO_CTX to 64 makes acft-whisper-tiny.en achieve 19 WER on earnings22.

can u share which parameter needs to be set in whisper wparams.audio_ctx = 1500;
to use this model.