futo-org/whisper-acft

Can't Reproduce WER reported

berkind1 opened this issue · 3 comments

Hi,

Thank you for sharing such an impressive project ! We encountered difficulties loading the provided fine-tuned models in .bin format. Could you provide code to load these models or share them in a safe-tensor format?

Since we were unable to load the provided fine-tuned model, we proceeded to fine-tune and evaluate the model using your provided code. However, we encountered challenges in reproducing the results documented in the readme file.

Initially, we fine-tuned whisper-tiny on the Florence dataset using fine-tune.ipynb without modifying hyperparameters. The resulting Word Error Rate (WER) for the Florence test set was 27.8.

Subsequently, using evaluation.ipynb, we evaluated the fine-tuned model on LibriSpeech Clean, LibriSpeech Other, and VoxPopuli datasets. Here are our findings:

Screenshot 2024-06-19 at 11 21 37 AM

These results do not align with the reported results in the readme file, which are:

librispeech clean tiny.en:

  • 4.96 - finetuned model, audio_ctx=1500
  • 5.50 - finetuned model, dynamic audio_ctx

librispeech other tiny.en:

  • 17.57 - finetuned model, audio_ctx=1500
  • 16.51 - finetuned model, dynamic audio_ctx

Could you please clarify a few points? Are the hyperparameters used in the provided code different from those used to generate the WER reported in the readme file? Could this discrepancy be the reason behind the differences in results?

Thank you so much !

I can't say exactly why your WER differs. Spinning up a standard RunPod instance with runpod/pytorch:2.1.0-py3.10-cuda11.8.0-devel-ubuntu22.04, running the unmodified finetune.ipynb for the full 8 epochs and then evaluating with the following parameters (to match whisper-tiny):

# If true, will use dynamic audio_ctx based on input length
# If false, will run the model normally (padding to 30sec)
USE_AUDIO_CTX = True

# Type of model, must match LOAD_FROM if set
MODEL_TYPE = "openai/whisper-tiny"

# Load from path (for finetuned model), or set to None if you'd like to use the normal one
LOAD_FROM = "/workspace/model_train-tiny3"

# Original results are with ADD_AUDIO_CTX = 8
ADD_AUDIO_CTX = 8

Gives the following result on my end, matching closer to the expected result:
Screenshot_20240620_175427

Note that the whisper-tiny model is distinct from whisper-tiny.en. The tiny.en model is only trained on English, and tiny is trained for other languages, and they use different tokenizers. The multilingual training task also means a higher English WER.

I suspect that your problem may be that during finetuning, you left the model at the default as tiny, but during evaluation you incorrectly set the MODEL_TYPE to openai/whisper-tiny.en. In this configuration, it will run the tiny model, decode its output with the incorrect tiny.en tokenizer and the results are all gibberish. If I do this mistake myself, I get similarly high WER that you reported:
Screenshot_20240620_175959

You can see that the results are guaranteed to be nonsense in this configuration by seeing what happens if you detokenize whisper-tiny tokens with whisper-tiny.en

whisper_tiny_en = pipeline("automatic-speech-recognition", model="openai/whisper-tiny.en", device=0)
whisper_tiny = pipeline("automatic-speech-recognition", model="openai/whisper-tiny", device=0)

tokenized = whisper_tiny.tokenizer.encode("This is an example sentence")
print(whisper_tiny_en.tokenizer.decode(tokenized))
# prints "<|en|>eper beers better jur<|startoftranscript|>"

Ensure you use the consistent model type during finetuning and evaluation. I will look into putting the original safetensor weights on HuggingFace.

PS: You mention you finetuned on the "Florence" dataset, if this is not a misspelling of the default fleurs dataset in the finetuning notebook, then I would suggest sticking to the default dataset if you're trying to reproduce the result

Thank you for the clarification. We discovered that instead of using whisper-tiny.en, we inadvertently fine tuned with whisper-tiny, which was indeed the source of the issue, as you suspected. After fine-tuning the correct model, we have successfully replicated the report results.

Screenshot 2024-06-20 at 12 02 10 PM

Thanks again for sharing your code and the fine-tuned models for this exciting project!

Regarding the safe-tensor format: Here is the whisper-tiny.en file used to generate the results above. It has been fine-tuned on the fleurs dataset using default parameters and your code. Feel free to integrate it into your repository for easier future use.

We've uploaded weights to our HF: https://huggingface.co/futo-org