Short form evaluation WER % for Librispeech clean test
guynich opened this issue · 3 comments
Hi, I'm enjoying working with this fascinating repo.
Looking at Stage 4 short form evaluation, I modified the short form evaluation bash script for Librispeech clean dataset (test split) for OpenAI Large-v2 model here and Small model here.
The generated WER % results are higher than the HuggingFace model card evaluation WER results which is unexpected.
E.g.:
model | script eval/wer | HF model card WER |
---|---|---|
OpenAI Large-v2 | 3.1683 | 3.0004 |
OpenAI Small | 4.0682 | 3.4322 |
Any suggestions what might be causing these WER value differences (perhaps my short form eval bash scripts) ?
The above table is with --language "en"
in the short form bash scripts. By removing this flag and rerunning the evaluation the eval/wer
values are lower.
E.g.:
model | eval/wer with --language "en" |
eval/wer without option --language |
HF model card WER |
---|---|---|---|
OpenAI Large-v2 | 3.1683 | 2.5685 | 3.0004 |
OpenAI Small | 4.0682 | 3.44541 | 3.4322 |
Without the --language
flag:
- Large-v2 model
eval/wer
is lower than the HuggingFace model card WER value, and lower than the original OpenAI paper result of 2.7% in Table 2. - Small model
eval/wer
is similar to the HuggingFace model card WER value.
Added Tiny model script and result here: https://github.com/guynich/distil-whisper/tree/main/training/scripts#summary.
I'm closing this issue: the small and tiny model results for HF model card
and eval/wer without option --language
are aligned sufficiently for me.
(I don't understand the discrepancy in values for Large-V2 but can leave that issue)