kan-bayashi/ParallelWaveGAN

training time for HiFiGAN LJSpeech

nellorebhanuteja opened this issue · 17 comments

Hi

I am training HiFIGAN vocoder on LJSpeech, using the recipe provided .
Its been running since more than a week.

I am using 4 Tesla GPUs with 32 GB memory

May I know how much time it took for you ?

@kan-bayashi

May I know how much time it took for you ?

You can check the log. It shows remaining time.

And please carefully check this part.

In the case of distributed training, the batch size will be automatically multiplied by the number of gpus.
Please be careful.

https://github.com/kan-bayashi/ParallelWaveGAN/blob/master/egs/README.md#:~:text=In%20the%20case%20of%20distributed%20training%2C%20the%20batch%20size%20will%20be%20automatically%20multiplied%20by%20the%20number%20of%20gpus.%0APlease%20be%20careful.

So if you run the config without modification using multi-gpu, it does not accelerate the training time since we use iteration-based training.
If you want to accelerate, you need to decrease batch size to original batch size / #gpus.
Then, 4 V100 should finish within 1 week.

thanks for the reply @kan-bayashi

i have followed the advice given by you
i use 4 gpus
i reduced batch_size from 16 to 4

However, the train time doesn't seem to have reduced

This following is my config file for reference

allow_cache: true
batch_max_steps: 8192
batch_size: 4
config: conf/hifigan.v1.yaml
dev_dumpdir: dump/dev/norm
dev_feats_scp: null
dev_segments: null
dev_wav_scp: null
discriminator_adv_loss_params:
  average_by_discriminators: false
discriminator_grad_norm: -1
discriminator_optimizer_params:
  betas:
  - 0.5
  - 0.9
  lr: 0.0002
  weight_decay: 0.0
discriminator_optimizer_type: Adam
discriminator_params:
  follow_official_norm: true
  period_discriminator_params:
    bias: true
    channels: 32
    downsample_scales:
    - 3
    - 3
    - 3
    - 3
    - 1
    in_channels: 1
    kernel_sizes:
    - 5
    - 3
    max_downsample_channels: 1024
    nonlinear_activation: LeakyReLU
    nonlinear_activation_params:
      negative_slope: 0.1
    out_channels: 1
    use_spectral_norm: false
    use_weight_norm: true
  periods:
  - 2
  - 3
  - 5
  - 7
  - 11
  scale_discriminator_params:
    bias: true
    channels: 128
    downsample_scales:
    - 4
    - 4
    - 4
    - 4
    - 1
    in_channels: 1
    kernel_sizes:
    - 15
    - 41
    - 5
    - 3
    max_downsample_channels: 1024
    max_groups: 16
    nonlinear_activation: LeakyReLU
    nonlinear_activation_params:
      negative_slope: 0.1
    out_channels: 1
  scale_downsample_pooling: AvgPool1d
  scale_downsample_pooling_params:
    kernel_size: 4
    padding: 2
    stride: 2
  scales: 3
discriminator_scheduler_params:
  gamma: 0.5
  milestones:
  - 200000
  - 400000
  - 600000
  - 800000
discriminator_scheduler_type: MultiStepLR
discriminator_train_start_steps: 0
discriminator_type: HiFiGANMultiScaleMultiPeriodDiscriminator
distributed: true
eval_interval_steps: 1000
feat_match_loss_params:
  average_by_discriminators: false
  average_by_layers: false
  include_final_outputs: false
fft_size: 1024
fmax: 7600
fmin: 80
format: hdf5
generator_adv_loss_params:
  average_by_discriminators: false
generator_grad_norm: -1
generator_optimizer_params:
  betas:
  - 0.5
  - 0.9
  lr: 0.0002
  weight_decay: 0.0
generator_optimizer_type: Adam
generator_params:
  bias: true
  channels: 512
  in_channels: 80
  kernel_size: 7
  nonlinear_activation: LeakyReLU
  nonlinear_activation_params:
    negative_slope: 0.1
  out_channels: 1
  resblock_dilations:
  - - 1
    - 3
    - 5
  - - 1
    - 3
    - 5
  - - 1
    - 3
    - 5
  resblock_kernel_sizes:
  - 3
  - 7
  - 11
  upsample_kernel_sizes:
  - 16
  - 16
  - 4
  - 4
  upsample_scales:
  - 8
  - 8
  - 2
  - 2
  use_additional_convs: true
  use_weight_norm: true
generator_scheduler_params:
  gamma: 0.5
  milestones:
  - 200000
  - 400000
  - 600000
  - 800000
generator_scheduler_type: MultiStepLR
generator_train_start_steps: 1
generator_type: HiFiGANGenerator
global_gain_scale: 1.0
hop_size: 256
lambda_adv: 1.0
lambda_aux: 45.0
lambda_feat_match: 2.0
log_interval_steps: 100
mel_loss_params:
  fft_size: 1024
  fmax: 11025
  fmin: 0
  fs: 22050
  hop_size: 256
  log_base: null
  num_mels: 80
  win_length: null
  window: hann
num_mels: 80
num_save_intermediate_results: 4
num_workers: 2
outdir: exp/train_nodev_ljspeech_hifigan.v1
pin_memory: true
pretrain: ''
rank: 3
remove_short_samples: false
resume: ''
sampling_rate: 22050
save_interval_steps: 10000
train_dumpdir: dump/train_nodev/norm
train_feats_scp: null
train_max_steps: 2500000
train_segments: null
train_wav_scp: null
trim_frame_size: 1024
trim_hop_size: 256
trim_silence: false
trim_threshold_in_db: 20
use_feat_match_loss: true
use_mel_loss: true
use_stft_loss: false
verbose: 1
version: 0.5.5
win_length: null
window: hann
world_size: 4

@kan-bayashi
can you please answer this question ?

Could you give me the logs of both cases you compared?

logs, as in, should i attach the train.log of both cases ?

Yes, please.

log_file.zip

i have attached log file that was generated after reducing batch size
unfortunately, i have not preserved train.log that was generated before in the other case.

We need to compare with the original setting to check the speed.

[train]:   0%|          | 1390/2500000 [13:20<393:16:14,  1.76it/s]

You can launch the config with batch size 16 and ngpu 1 and check 1.76it/s part.

train.log

ok
i ran with the original config on 1gpu
this is the log

[train]:   0%|          | 26/2500000 [00:28<650:33:48,  1.07it/s]

It seems the training speed is increasing (about x1.6), what is your problem?

in both cases
training is taking 4 weeks to run (estimatedly)

i used joint model in espnet-tts, and was expecting vocoder training to take about 1 week

[train]:   0%|          | 14/2500000 [00:12<406:46:23,  1.71it/s]

Your first log shows estimated required time is 2 weeks+.

i used joint model in espnet-tts, and was expecting vocoder training to take about 1 week

The number of iterations is totally different.

right
so, you say vocoder training requires more steps that joint model training ?

I stopped at 1M iterations since the generated voice is enough quality and 2.5M takes too long time with my limited GPU resources.
If you have enough GPU resources, it is worthwhile to try longer training.

ok
thanks a lot for patiently answering!