huggingface/distil-whisper

Short form evaluation WER % for Librispeech clean test

guynich opened this issue · 3 comments

Hi, I'm enjoying working with this fascinating repo.

Looking at Stage 4 short form evaluation, I modified the short form evaluation bash script for Librispeech clean dataset (test split) for OpenAI Large-v2 model here and Small model here.

The generated WER % results are higher than the HuggingFace model card evaluation WER results which is unexpected.

E.g.:

model script eval/wer HF model card WER
OpenAI Large-v2 3.1683 3.0004
OpenAI Small 4.0682 3.4322

Any suggestions what might be causing these WER value differences (perhaps my short form eval bash scripts) ?

The above table is with --language "en" in the short form bash scripts. By removing this flag and rerunning the evaluation the eval/wer values are lower.

E.g.:

model eval/wer with --language "en" eval/wer without option --language HF model card WER
OpenAI Large-v2 3.1683 2.5685 3.0004
OpenAI Small 4.0682 3.44541 3.4322

Without the --language flag:

  • Large-v2 model eval/wer is lower than the HuggingFace model card WER value, and lower than the original OpenAI paper result of 2.7% in Table 2.
  • Small model eval/wer is similar to the HuggingFace model card WER value.

I'm closing this issue: the small and tiny model results for HF model card and eval/wer without option --language are aligned sufficiently for me.

(I don't understand the discrepancy in values for Large-V2 but can leave that issue)