neonbjb/DL-Art-School

Injector for Autoregressive decoder

Closed this issue · 4 comments

faad3 commented

Hello! I found the GptVoiceLatentInjector injector. It looks like it's for inference, anyway it's not difficult to return losses using it. But I still wanted to clarify which injector you used for training. Because I also found a GptTtsDataset dataset that returns quantized_mels, as I understand it, these are mels pre-processed by dVAE, but GptVoiceLatentInjector takes wav as input.

What model are you looking to train?

To save time, here is the full step configuration for the AR model:

steps:        
  gpt_train:
    training: gpt
    loss_log_buffer: 500

    # Generally follows the recipe from the DALLE paper.
    optimizer: adamw_zero
    optimizer_params:
      lr: !!float 1e-4
      weight_decay: !!float 1e-2
      beta1: 0.9
      beta2: 0.96
    clip_grad_eps: 4

    injectors:    
      paired_to_mel:
        type: torch_mel_spectrogram
        mel_norm_file: ../experiments/clips_mel_norms.pth
        in: wav
        out: paired_mel
      paired_cond_to_mel:
        type: for_each
        subtype: torch_mel_spectrogram
        mel_norm_file: ../experiments/clips_mel_norms.pth
        in: conditioning
        out: paired_conditioning_mel
      to_codes:
        type: discrete_token
        in: paired_mel
        out: paired_mel_codes
      paired_fwd_text:
        type: generator
        generator: gpt
        in: [paired_conditioning_mel, padded_text, text_lengths, paired_mel_codes, wav_lengths]
        out: [loss_text_ce, loss_mel_ce, logits]      
    losses:
      text_ce:
        type: direct
        weight: .01
        key: loss_text_ce
      mel_ce:
        type: direct
        weight: 1
        key: loss_mel_ce

And here it is for the diffusion model:


steps:
  generator:
    training: generator
    loss_log_buffer: 2000
    step_outputs: [loss]

    optimizer: adamw
    optimizer_params:
      lr: !!float 1e-4
      weight_decay: 0.001
      beta1: 0.9
      beta2: 0.999
    clip_grad_eps: 1.0

    injectors:
      to_mel:
        type: torch_mel_spectrogram
        mel_norm_file: ../experiments/clips_mel_norms.pth
        in: wav
        out: mel
      resample_wav:
        type: audio_resample
        in: wav
        out: wav_for_vocoder
        input_sample_rate: 22050
        output_sample_rate: 24000
      tacotron_mel:
        type: mel_spectrogram
        mel_fmax: 12000
        sampling_rate: 24000
        n_mel_channels: 100
        # Only normalize the MEL target, because the diffuser specifically cares about it.
        do_normalization: true
        in: wav_for_vocoder
        out: target_mel
      resample_cond:
        type: for_each
        subtype: audio_resample
        input_sample_rate: 22050
        output_sample_rate: 24000
        in: conditioning
        out: conditioning_for_vocoder
      cond_to_mel:
        type: for_each
        subtype: mel_spectrogram
        mel_fmax: 12000
        sampling_rate: 24000
        n_mel_channels: 100
        in: conditioning_for_vocoder
        out: cond_mel
      produce_latents:
        type: gpt_voice_latent
        gpt_path: ../experiments/finetune_gpt_unified_large_kennedy/models/800_gpt_ema.pth
        in: wav
        conditioning_clip: conditioning
        text: padded_text
        text_lengths: text_lengths
        input_lengths: wav_lengths
        out: gpt_latent
      diffusion:
        type: gaussian_diffusion
        in: target_mel
        generator: generator
        beta_schedule:
          schedule_name: linear
          num_diffusion_timesteps: 4000
        diffusion_args:
          model_mean_type: epsilon
          model_var_type: learned_range
          loss_type: mse
        sampler_type: uniform
        model_input_keys:
          aligned_conditioning: gpt_latent
          conditioning_input: cond_mel
          return_code_pred: true
        extra_model_output_keys: [mel_pred]
        out: loss
        out_key_vb_loss: vb_loss
        out_key_x_start: x_start_pred
    losses:
      diffusion_loss:
        after: 500
        type: direct
        weight: 1
        key: loss
      var_loss:
        after: 500
        type: direct
        weight: 1
        key: vb_loss
      mel_surrogate:
        type: pix
        weight: 1
        criterion: l2
        real: target_mel
        fake: mel_pred

Note that I have not provided (and will not provide) some of the things that you would need to make these step configs work, notably the DVAE. There are also some "weird" things like transitioning from 22kHz audio to 24kHz audio which were driven by my late stage decision to use a vocoder.

I strongly recommend you do not actually attempt to train your models with DLAS. It is a very rough sandbox I have built and maintain for my personal use. It is not going to be fun to get working for someone else. I would highly recommend doing your training in something better supported like pytorch lightning or fairscale. Hopefully the above configs can help you decipher what the pipeline and loss structure looks like.

faad3 commented

wow thanks!

To save time, here is the full step configuration for the AR model:

steps:        
  gpt_train:
    training: gpt
    loss_log_buffer: 500

    # Generally follows the recipe from the DALLE paper.
    optimizer: adamw_zero
    optimizer_params:
      lr: !!float 1e-4
      weight_decay: !!float 1e-2
      beta1: 0.9
      beta2: 0.96
    clip_grad_eps: 4

    injectors:    
      paired_to_mel:
        type: torch_mel_spectrogram
        mel_norm_file: ../experiments/clips_mel_norms.pth
        in: wav
        out: paired_mel
      paired_cond_to_mel:
        type: for_each
        subtype: torch_mel_spectrogram
        mel_norm_file: ../experiments/clips_mel_norms.pth
        in: conditioning
        out: paired_conditioning_mel
      to_codes:
        type: discrete_token
        in: paired_mel
        out: paired_mel_codes
      paired_fwd_text:
        type: generator
        generator: gpt
        in: [paired_conditioning_mel, padded_text, text_lengths, paired_mel_codes, wav_lengths]
        out: [loss_text_ce, loss_mel_ce, logits]      
    losses:
      text_ce:
        type: direct
        weight: .01
        key: loss_text_ce
      mel_ce:
        type: direct
        weight: 1
        key: loss_mel_ce

And here it is for the diffusion model:


steps:
  generator:
    training: generator
    loss_log_buffer: 2000
    step_outputs: [loss]

    optimizer: adamw
    optimizer_params:
      lr: !!float 1e-4
      weight_decay: 0.001
      beta1: 0.9
      beta2: 0.999
    clip_grad_eps: 1.0

    injectors:
      to_mel:
        type: torch_mel_spectrogram
        mel_norm_file: ../experiments/clips_mel_norms.pth
        in: wav
        out: mel
      resample_wav:
        type: audio_resample
        in: wav
        out: wav_for_vocoder
        input_sample_rate: 22050
        output_sample_rate: 24000
      tacotron_mel:
        type: mel_spectrogram
        mel_fmax: 12000
        sampling_rate: 24000
        n_mel_channels: 100
        # Only normalize the MEL target, because the diffuser specifically cares about it.
        do_normalization: true
        in: wav_for_vocoder
        out: target_mel
      resample_cond:
        type: for_each
        subtype: audio_resample
        input_sample_rate: 22050
        output_sample_rate: 24000
        in: conditioning
        out: conditioning_for_vocoder
      cond_to_mel:
        type: for_each
        subtype: mel_spectrogram
        mel_fmax: 12000
        sampling_rate: 24000
        n_mel_channels: 100
        in: conditioning_for_vocoder
        out: cond_mel
      produce_latents:
        type: gpt_voice_latent
        gpt_path: ../experiments/finetune_gpt_unified_large_kennedy/models/800_gpt_ema.pth
        in: wav
        conditioning_clip: conditioning
        text: padded_text
        text_lengths: text_lengths
        input_lengths: wav_lengths
        out: gpt_latent
      diffusion:
        type: gaussian_diffusion
        in: target_mel
        generator: generator
        beta_schedule:
          schedule_name: linear
          num_diffusion_timesteps: 4000
        diffusion_args:
          model_mean_type: epsilon
          model_var_type: learned_range
          loss_type: mse
        sampler_type: uniform
        model_input_keys:
          aligned_conditioning: gpt_latent
          conditioning_input: cond_mel
          return_code_pred: true
        extra_model_output_keys: [mel_pred]
        out: loss
        out_key_vb_loss: vb_loss
        out_key_x_start: x_start_pred
    losses:
      diffusion_loss:
        after: 500
        type: direct
        weight: 1
        key: loss
      var_loss:
        after: 500
        type: direct
        weight: 1
        key: vb_loss
      mel_surrogate:
        type: pix
        weight: 1
        criterion: l2
        real: target_mel
        fake: mel_pred

Note that I have not provided (and will not provide) some of the things that you would need to make these step configs work, notably the DVAE. There are also some "weird" things like transitioning from 22kHz audio to 24kHz audio which were driven by my late stage decision to use a vocoder.

I strongly recommend you do not actually attempt to train your models with DLAS. It is a very rough sandbox I have built and maintain for my personal use. It is not going to be fun to get working for someone else. I would highly recommend doing your training in something better supported like pytorch lightning or fairscale. Hopefully the above configs can help you decipher what the pipeline and loss structure looks like.

@neonbjb hi, would you like to share the config file for CLVP model, thanks a lot.