speechbrain/speechbrain

LibriSpeech Whisper Finetuning - WER 98% after 3 epochs

FrancescoBonzi opened this issue · 9 comments

Describe the bug

I reproduced the recipe for Whisper fine-tuning using LibriSpeech as it is:

  1. download of LibriSpeech from http://www.openslr.org/12
  2. run librispeech_prepare.py (train-clean-100, dev-clean, test-clean)
  3. python train_with_whisper.py hparams/train_hf_whisper.yaml (with whisper-tiny)

After 3 epochs the computed WER on test-clean is about 98%.
Please, help me figure out if I made something wrong or if there's a bug in the script.

Expected behaviour

I expect the wer is around 6%, like it happens with the HuggingFace recipe (using LibriSpeech).

To Reproduce

No response

Environment Details

I'm using a SageMaker Notebook with ml.g4dn machine (16gb NVIDIA T4).

Relevant Log Output

after 3 epochs the wer_test-clean.txt file contains:

Format:
<utterance-id>, WER DETAILS
<eps> ; reference  ; on ; the ; first ;  line
  I   ;     S      ; =  ;  =  ;   S   ;   D  
 and  ; hypothesis ; on ; the ; third ; <eps>
================================================================================
672-122797-0033, %WER 50.00 [ 1 / 2, 0 ins, 1 del, 0 sub ]
a ; story
= ;   D  
a ; <eps>
================================================================================
2094-142345-0041, %WER 100.00 [ 1 / 1, 0 ins, 0 del, 1 sub ]
direction
    S    
    di   
================================================================================
2830-3980-0026, %WER 100.00 [ 2 / 2, 0 ins, 1 del, 1 sub ]
verse ;   2  
  S   ;   D  
  f   ; <eps>
================================================================================
237-134500-0025, %WER 50.00 [ 1 / 2, 0 ins, 1 del, 0 sub ]
0 ;  emil
= ;   D  
0 ; <eps>
================================================================================
...
================================================================================
5105-28241-0007, %WER 100.00 [ 7 / 7, 0 ins, 6 del, 1 sub ]
there ;   is  ;   no  ;  fear ;   of  ;  that ;  sir 
  S   ;   D   ;   D   ;   D   ;   D   ;   D   ;   D  
  th  ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps>
================================================================================
6930-76324-0022, %WER 100.00 [ 4 / 4, 0 ins, 3 del, 1 sub ]
then ;  she  ; suddenly ; remarked
 S   ;   D   ;    D     ;    D    
 th  ; <eps> ;  <eps>   ;  <eps>  
================================================================================
4446-2275-0039, %WER 80.00 [ 4 / 5, 0 ins, 4 del, 0 sub ]
i ;  must ;  know ; about ;  you 
= ;   D   ;   D   ;   D   ;   D  
i ; <eps> ; <eps> ; <eps> ; <eps>
================================================================================
5683-32879-0025, %WER 100.00 [ 4 / 4, 0 ins, 3 del, 1 sub ]
thank ;  you  ; dorcas ;  dear
  S   ;   D   ;   D    ;   D  
  th  ; <eps> ; <eps>  ; <eps>
================================================================================
...

Additional Context

The loss decreases really fast within the first epoch (from around 1.9 to 0.3) and then remains stable.

Hello @FrancescoBonzi,

Sorry for the issue! There's an ongoing refactoring of Whisper fine tuning here: #2450

I fixed some issues and now you can train a whisper model and obtain competitive results (i.e. in this case, I went from 2.07% of WER to 1.72%).

I am currently working on this and there will be some slight changes but if you want you can use this pull request as the basis of your work... sorry again.

Amazing super fast answer!! Thank you very much, I'll go on with this PR. Does it support also fine-tuning with timestamps?

Amazing super fast answer!! Thank you very much, I'll go on with this PR. Does it support also fine-tuning with timestamps?

Unfortunately, not yet. But I plan to add it soon. Basically, I'm improving a lot our interface with Whisper so that we support everything (flash attention, kv cache, prompting etc). I might also add this feature if this is a strong request form the community.

NOTE: as I said, this PR is subject to changes. I am still working heavily on it but I got some good numbers. I didn't cleaned everything so you might have to change some path in yaml etc... sorry for the mess as it is a draft PR there's still some ongoing things to change.

I see that there is a lack of material on fine-tuning Whisper with timestamps, maybe this repo but it seems no longer maintained. In general, I think using timestamps is an essential feature for each strong and reliable new version of Whisper, and Speechbrain could be the right place to find it. I'm really interested about it, we're trying to fine-tune Whisper on song lyrics! If you need a hand I can try to help you.

I see that there is a lack of material on fine-tuning Whisper with timestamps, maybe this repo but it seems no longer maintained. In general, I think using timestamps is an essential feature for each strong and reliable new version of Whisper, and Speechbrain could be the right place to find it. I'm really interested about it, we're trying to fine-tune Whisper on song lyrics! If you need a hand I can try to help you.

I would say it would be a lovely feature and a nice help of you if you could contribue on this feature! I was also looking at this repo but I dunno if the implementation is giving good results? Maybe you could explore this and lemme know? I haven't spent enough time understanding how it works, I still have some troubles to understand how the alignment is performed through tokens TBH (e.g. how could you say that the word "hello" is at frames 2 to 5 only by using textual representation? I suspect that is this is the case, then maybe whisper is not that good at alignment but I need to explore a bit more)

Here I see that the authors trained the model using a precision of 0.02 seconds (1501 special tokens from 0.0s to 30.0s) and treated these tokens like all the others, using one-hot labels. I think at inference time Whisper predicts sentence timestamps and use DTW to predict word timestamps.
While here, the authors explain how they prepared the dataset for training with timestamps.

I'll check this repo the next few days.

Here I see that the authors trained the model using a precision of 0.02 seconds (1501 special tokens from 0.0s to 30.0s) and treated these tokens like all the others, using one-hot labels. I think at inference time Whisper predicts sentence timestamps and use DTW to predict word timestamps. While here, the authors explain how they prepared the dataset for training with timestamps.

I'll check this repo the next few days.

Okay! Im currently training some CommonVoice Whisper models (large and small on French and italian, I'll maybe try English as well). I will keep you posted on the results but so far I got some good numbers. I don't know yet if I will add in this PR timestamps supports. I think I will focus on having strong baselines + adding support of long form ASR / prompting. I don't know if it would requires a crazy amount of time to add timestamps TBH. If you want you could open a PR on that? I will review it of course and it could be a nice thing to add in speechbrain :)

Okay, I think the code here is a good starting point to address training with timestamps, but it needs some improvements to work with multi-gpu and to be more flexibile. I may build upon your code when it is completed to support also timestamps.

Hello, we've merged the new Whisper PR that is fixing a bunch of issues :)

Feel free to git clone the latest SB version in order to use it :)

Thanks again for reporting this issue.