sony/ai-research-code

【NVC-Net】How many epochs will the model converge?

Charlottecuc opened this issue · 11 comments

e.g. For the VTCK dataset

Besides, have you tested whether the model is robust with noisy source files (e.g. recorded by mobile phone, with background of air conditioning, or heavy breathing, which is quite common in real life application) at inference time?

Thank you very much

Yes, it's interesting to see this. However, we haven't tested the model in noisy audio.

On the VCTK dataset, we trained with 400 epochs.

@TE-BacNguyenCong Hi, is it possible to share us the losses of your model (at about 400 epochs)? Thank you very much

@TE-BacNguyenCong Beisdes, I'm using 8 V100 cards to train the default model but the GPU utilization is quite low (8% average, about 4 or 5 hour per epoch), have you also encountered such problem?

@TE-BacNguyenCong Is set with_memory_cache and with_file_cache to be True a good idea to speed up the training process?

Beisdes, I'm using 8 V100 cards to train the default model but the GPU utilization is quite low (8% average, about 4 or 5 hour per epoch), have you also encountered such problem?

This is strange. We used 4 V100 GPUs and training took around 15 minutes per epoch. I guess the overhead could be I/O operations (reading files, etc, ...)

Is set with_memory_cache and with_file_cache to be True a good idea to speed up the training process?

No, because inputs are segments randomly sampled per iteration and we don't want to have the same segments all the time.

Beisdes, I'm using 8 V100 cards to train the default model but the GPU utilization is quite low (8% average, about 4 or 5 hour per epoch), have you also encountered such problem?

This is strange. We used 4 V100 GPUs and training took around 15 minutes per epoch. I guess the overhead could be I/O operations (reading files, etc, ...)

I checked and found that the average time for dataloading is about 0.001s, but the backward procedure is time-consuming:

Average time, batch size 4, V100 16G; dataloading_time: 0.00104; train_discriminator_forward: 2.99111; train_discriminator_backward: 1.34156; train_generator_forward: 5.534698; train_generator_backward: 10.752684; total_average_time_per_batch: 20.924966
Could you give any suggestion for ways of increasing training speed? Thank you very much @TE-BacNguyenCong

I also tested the speed on eight new 32G V100 cards (batch size 10, default NVCnet code, default VCTK dataset, dafault docker nnabla cuda image), the average training speed can reached:
dataloading time: 0.0016777515411376953
train_discriminator_forward: 1.6020491123199463
train_discriminator_backward: 0.7688419818878174
train_generator_forward: 2.8338851928710938
train_generator_backward: 6.0216124057769775
total_average_time_per_batch: 12.376032829284668

It seems that the speed is also very slow.

Solved after upgrading the driver.

@Charlottecuc I got the same problem, about 4 or 5 hour per epoch. Could u tell me how to solve this problem? And CUDA version is 11.0.

@Charlottecuc have you trained this model? does it reproduce the result of demo page?