speechbrain/speechbrain

Cannot reproduce DPRNN results on WSJ0-2Mix (Speech Separation)

SevKod opened this issue · 6 comments

Describe the bug

Currently trying to reach the state-of-the-art results for the DPRNN model on WSJ0-2Mix using the provided recipe (https://github.com/speechbrain/speechbrain/blob/341e35c3bc0c2ff3b9c6257c7bb231a23891df30/recipes/WSJ0Mix/separation/hparams/dprnn.yaml). After checking at the training logs that are also provided for this corresponding experiment (here : https://www.dropbox.com/sh/o8fohu5s07h4bnw/AADQO9_Nbd8O1VqG2DNVNp-fa/1234?dl=0&preview=train_log.txt&subfolder_nav_tracking=1), I am supposed to obtain a Si-SNRi of -18.5 on the validation set after only 20 epochs !

However, after training it myself, I am stagnating at -15 after almost 30 epochs, as can be seen in my logs below :

epoch: 1, lr: 1.50e-04 - train si-snr: -5.72e+00 - valid si-snr: -7.85e+00
epoch: 2, lr: 1.50e-04 - train si-snr: -9.33e+00 - valid si-snr: -9.91e+00
epoch: 3, lr: 1.50e-04 - train si-snr: -1.08e+01 - valid si-snr: -1.12e+01
epoch: 4, lr: 1.50e-04 - train si-snr: -1.16e+01 - valid si-snr: -1.20e+01
epoch: 5, lr: 1.50e-04 - train si-snr: -1.22e+01 - valid si-snr: -1.21e+01
epoch: 6, lr: 1.50e-04 - train si-snr: -1.27e+01 - valid si-snr: -1.28e+01
epoch: 7, lr: 1.50e-04 - train si-snr: -1.30e+01 - valid si-snr: -1.31e+01
epoch: 8, lr: 1.50e-04 - train si-snr: -1.33e+01 - valid si-snr: -1.34e+01
epoch: 9, lr: 1.50e-04 - train si-snr: -1.36e+01 - valid si-snr: -1.36e+01
epoch: 10, lr: 1.50e-04 - train si-snr: -1.38e+01 - valid si-snr: -1.37e+01
epoch: 11, lr: 1.50e-04 - train si-snr: -1.39e+01 - valid si-snr: -1.39e+01
epoch: 12, lr: 1.50e-04 - train si-snr: -1.41e+01 - valid si-snr: -1.40e+01
epoch: 13, lr: 1.50e-04 - train si-snr: -1.42e+01 - valid si-snr: -1.39e+01
epoch: 14, lr: 1.50e-04 - train si-snr: -1.44e+01 - valid si-snr: -1.43e+01
epoch: 15, lr: 1.50e-04 - train si-snr: -1.45e+01 - valid si-snr: -1.44e+01
epoch: 16, lr: 1.50e-04 - train si-snr: -1.46e+01 - valid si-snr: -1.44e+01
epoch: 17, lr: 1.50e-04 - train si-snr: -1.47e+01 - valid si-snr: -1.45e+01
epoch: 18, lr: 1.50e-04 - train si-snr: -1.48e+01 - valid si-snr: -1.46e+01
epoch: 19, lr: 1.50e-04 - train si-snr: -1.49e+01 - valid si-snr: -1.47e+01
epoch: 20, lr: 1.50e-04 - train si-snr: -1.50e+01 - valid si-snr: -1.47e+01
epoch: 21, lr: 1.50e-04 - train si-snr: -1.50e+01 - valid si-snr: -1.48e+01
epoch: 22, lr: 1.50e-04 - train si-snr: -1.51e+01 - valid si-snr: -1.49e+01
epoch: 23, lr: 1.50e-04 - train si-snr: -1.52e+01 - valid si-snr: -1.49e+01
epoch: 24, lr: 1.50e-04 - train si-snr: -1.52e+01 - valid si-snr: -1.49e+01
epoch: 25, lr: 1.50e-04 - train si-snr: -1.53e+01 - valid si-snr: -1.50e+01
epoch: 26, lr: 1.50e-04 - train si-snr: -1.54e+01 - valid si-snr: -1.51e+01
epoch: 27, lr: 1.50e-04 - train si-snr: -1.54e+01 - valid si-snr: -1.51e+01

This was done using the dprnn.yaml conf file. I do not know why this behavior happens. I also had a similar result (-14) using the Asteroïd library. Could someone attempt to train this model and tell me if they get similar results ?

[pip3] numpy==1.26.4
[pip3] torch==2.2.0
[pip3] torchaudio==2.2.0
[pip3] torchvision==0.17.0
[pip3] triton==2.2.0

Expected behaviour

The DPRNN recipe is supposed to obtain a -18.5 Si-SNR after training for 60 epochs, and -18 Si-SNR on validation after 20 epochs. After training, I plateau at -15 after 27 epochs.

To Reproduce

No response

Environment Details

No response

Relevant Log Output

No response

Additional Context

No response

Turns out I was using the spheres from both ".wv1" and ".wv2" to generate the ".wav" files. The WSJ0-2mix results from papers all come from the ".wv1" version. Used the Asteroïd script "convert_sphere2wav.sh" to get them converted to wav. So, results are reproducible !

Turns out I was using the spheres from both ".wv1" and ".wv2" to generate the ".wav" files. The WSJ0-2mix results from papers all come from the ".wv1" version. Used the Asteroïd script "convert_sphere2wav.sh" to get them converted to wav. So, results are reproducible !

Can you share please the time of training and which gpu have been used? Thank you!

Turns out I was using the spheres from both ".wv1" and ".wv2" to generate the ".wav" files. The WSJ0-2mix results from papers all come from the ".wv1" version. Used the Asteroïd script "convert_sphere2wav.sh" to get them converted to wav. So, results are reproducible !

Can you share please the time of training and which gpu have been used? Thank you!

So far, I reached a Si-SNR of -17.731 dB after approximately 30 hours of training using a single a40-48 Gb, which corresponds to 43 epochs. I decided to use the Asteroïd library for the training though (but managed to get the same training curve on early epochs on both libraries). I used the same training configuration as in SpeechBrain (lr=1.5e-4, batch size=1 and N_filters = 256) I also did not use either Speed Augment or Dynamic Mixing. There are still some training to do, but it is way way better (and close to the -18.8 dB of the paper) than the previous training

Turns out I was using the spheres from both ".wv1" and ".wv2" to generate the ".wav" files. The WSJ0-2mix results from papers all come from the ".wv1" version. Used the Asteroïd script "convert_sphere2wav.sh" to get them converted to wav. So, results are reproducible !

Can you share please the time of training and which gpu have been used? Thank you!

So far, I reached a Si-SNR of -17.731 dB after approximately 30 hours of training using a single a40-48 Gb, which corresponds to 43 epochs. I decided to use the Asteroïd library for the training though (but managed to get the same training curve on early epochs on both libraries). I used the same training configuration as in SpeechBrain (lr=1.5e-4, batch size=1 and N_filters = 256) I also did not use either Speed Augment or Dynamic Mixing. There are still some training to do, but it is way way better (and close to the -18.8 dB of the paper) than the previous training

Awesome, I'll try with the a100 of colab pro thank u!

Turns out I was using the spheres from both ".wv1" and ".wv2" to generate the ".wav" files. The WSJ0-2mix results from papers all come from the ".wv1" version. Used the Asteroïd script "convert_sphere2wav.sh" to get them converted to wav. So, results are reproducible !

Can you share please the time of training and which gpu have been used? Thank you!

So far, I reached a Si-SNR of -17.731 dB after approximately 30 hours of training using a single a40-48 Gb, which corresponds to 43 epochs. I decided to use the Asteroïd library for the training though (but managed to get the same training curve on early epochs on both libraries). I used the same training configuration as in SpeechBrain (lr=1.5e-4, batch size=1 and N_filters = 256) I also did not use either Speed Augment or Dynamic Mixing. There are still some training to do, but it is way way better (and close to the -18.8 dB of the paper) than the previous training

Awesome, I'll try with the a100 of colab pro thank u!

Please let me know if you can reproduce the results too 😁

Turns out I was using the spheres from both ".wv1" and ".wv2" to generate the ".wav" files. The WSJ0-2mix results from papers all come from the ".wv1" version. Used the Asteroïd script "convert_sphere2wav.sh" to get them converted to wav. So, results are reproducible !

Can you share please the time of training and which gpu have been used? Thank you!

So far, I reached a Si-SNR of -17.731 dB after approximately 30 hours of training using a single a40-48 Gb, which corresponds to 43 epochs. I decided to use the Asteroïd library for the training though (but managed to get the same training curve on early epochs on both libraries). I used the same training configuration as in SpeechBrain (lr=1.5e-4, batch size=1 and N_filters = 256) I also did not use either Speed Augment or Dynamic Mixing. There are still some training to do, but it is way way better (and close to the -18.8 dB of the paper) than the previous training

Awesome, I'll try with the a100 of colab pro thank u!

Just a quick update ! Training is finished ! I used the basic Asteroïd recipe (the SpeechBrain recipe reached 18.68 dB on the validation set, which was not good enough strangely). I reached -19.9 dB on the validation set, and -19.7 dB (to compare with -18.8 dB obtained by the DPRNN article) on the test set. This was reached after 200 epochs.