Short form evaluation WER % for Librispeech clean test

Question

Short form evaluation WER % for Librispeech clean test

guynich opened this issue 10 months ago · 3 comments

Hi, I'm enjoying working with this fascinating repo.

Looking at Stage 4 short form evaluation, I modified the short form evaluation bash script for Librispeech clean dataset (test split) for OpenAI Large-v2 model here and Small model here.

The generated WER % results are higher than the HuggingFace model card evaluation WER results which is unexpected.

E.g.:

model	script eval/wer	HF model card WER
OpenAI Large-v2	3.1683	3.0004
OpenAI Small	4.0682	3.4322

Any suggestions what might be causing these WER value differences (perhaps my short form eval bash scripts) ?

Answer 1 · 2024-02-17T19:37:42.000Z

The above table is with --language "en" in the short form bash scripts. By removing this flag and rerunning the evaluation the eval/wer values are lower.

E.g.:

model	eval/wer with `--language "en"`	eval/wer without option `--language`	HF model card WER
OpenAI Large-v2	3.1683	2.5685	3.0004
OpenAI Small	4.0682	3.44541	3.4322

Without the --language flag:

Large-v2 model eval/wer is lower than the HuggingFace model card WER value, and lower than the original OpenAI paper result of 2.7% in Table 2.
Small model eval/wer is similar to the HuggingFace model card WER value.

Answer 2 · 2024-02-20T18:44:11.000Z

Added Tiny model script and result here: https://github.com/guynich/distil-whisper/tree/main/training/scripts#summary.

Answer 3 · 2024-02-20T18:48:14.000Z

I'm closing this issue: the small and tiny model results for HF model card and eval/wer without option --language are aligned sufficiently for me.

(I don't understand the discrepancy in values for Large-V2 but can leave that issue)