neonbjb/DL-Art-School

How many steps should we train? 1 speaker 2 hours of data 691 wavs it is pretty fast

Opened this issue · 0 comments

Here my config but i have no idea
it is making checkpoints pretty fast too

Environment name is set as "DLAS" as per environment.yaml
anaconda3/miniconda3 detected in C:\Users\King\miniconda3
Starting conda environment "DLAS" from C:\Users\King\miniconda3
Latest git hash: 5ab4d9e
Disabled distributed training.

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
For effortless bug reporting copy-paste your error into this form: https://docs.google.com/forms/d/e/1FAIpQLScPB8emS3Thkp66nvqwmjTEgxp8Y9ufuWTzFyr9kJ5AoI47dQ/viewform?usp=sf_link
================================================================================
CUDA SETUP: Loading binary C:\Users\King\miniconda3\envs\DLAS\lib\site-packages\bitsandbytes\libbitsandbytes_cuda116.dll...
23-05-01 03:03:10.550 - INFO:   name: test1
  model: extensibletrainer
  scale: 1
  gpu_ids: [0]
  start_step: -1
  checkpointing_enabled: True
  fp16: False
  use_8bit: True
  wandb: False
  use_tb_logger: True
  datasets:[
    train:[
      name: test1
      n_workers: 8
      batch_size: 138
      mode: paired_voice_audio
      path: F:/ozen-toolkit/output/test\train.txt
      fetcher_mode: ['lj']
      phase: train
      max_wav_length: 255995
      max_text_length: 200
      sample_rate: 22050
      load_conditioning: True
      num_conditioning_candidates: 2
      conditioning_length: 44000
      use_bpe_tokenizer: True
      load_aligned_codes: False
      data_type: img
    ]
    val:[
      name: test1
      n_workers: 1
      batch_size: 139
      mode: paired_voice_audio
      path: F:/ozen-toolkit/output/test\valid.txt
      fetcher_mode: ['lj']
      phase: val
      max_wav_length: 255995
      max_text_length: 200
      sample_rate: 22050
      load_conditioning: True
      num_conditioning_candidates: 2
      conditioning_length: 44000
      use_bpe_tokenizer: True
      load_aligned_codes: False
      data_type: img
    ]
  ]
  steps:[
    gpt_train:[
      training: gpt
      loss_log_buffer: 500
      optimizer: adamw
      optimizer_params:[
        lr: 1e-05
        triton: False
        weight_decay: 0.01
        beta1: 0.9
        beta2: 0.96
      ]
      clip_grad_eps: 4
      injectors:[
        paired_to_mel:[
          type: torch_mel_spectrogram
          mel_norm_file: ../experiments/clips_mel_norms.pth
          in: wav
          out: paired_mel
        ]
        paired_cond_to_mel:[
          type: for_each
          subtype: torch_mel_spectrogram
          mel_norm_file: ../experiments/clips_mel_norms.pth
          in: conditioning
          out: paired_conditioning_mel
        ]
        to_codes:[
          type: discrete_token
          in: paired_mel
          out: paired_mel_codes
          dvae_config: ../experiments/train_diffusion_vocoder_22k_level.yml
        ]
        paired_fwd_text:[
          type: generator
          generator: gpt
          in: ['paired_conditioning_mel', 'padded_text', 'text_lengths', 'paired_mel_codes', 'wav_lengths']
          out: ['loss_text_ce', 'loss_mel_ce', 'logits']
        ]
      ]
      losses:[
        text_ce:[
          type: direct
          weight: 0.01
          key: loss_text_ce
        ]
        mel_ce:[
          type: direct
          weight: 1
          key: loss_mel_ce
        ]
      ]
    ]
  ]
  networks:[
    gpt:[
      type: generator
      which_model_G: unified_voice2
      kwargs:[
        layers: 30
        model_dim: 1024
        heads: 16
        max_text_tokens: 402
        max_mel_tokens: 604
        max_conditioning_inputs: 2
        mel_length_compression: 1024
        number_text_tokens: 256
        number_mel_codes: 8194
        start_mel_token: 8192
        stop_mel_token: 8193
        start_text_token: 255
        train_solo_embeddings: False
        use_mel_codes_as_input: True
        checkpointing: True
      ]
    ]
  ]
  path:[
    pretrain_model_gpt: ../experiments/autoregressive.pth
    strict_load: True
    root: F:\DL-Art-School
    experiments_root: F:\DL-Art-School\experiments\test1
    models: F:\DL-Art-School\experiments\test1\models
    training_state: F:\DL-Art-School\experiments\test1\training_state
    log: F:\DL-Art-School\experiments\test1
    val_images: F:\DL-Art-School\experiments\test1\val_images
  ]
  train:[
    niter: 50000
    warmup_iter: -1
    mega_batch_factor: 4
    val_freq: 500
    default_lr_scheme: MultiStepLR
    gen_lr_steps: [400, 800, 1120, 1440]
    lr_gamma: 0.5
    ema_enabled: False
    manual_seed: 1337
  ]
  eval:[
    output_state: gen
    injectors:[
      gen_inj_eval:[
        type: generator
        generator: generator
        in: hq
        out: ['gen', 'codebook_commitment_loss']
      ]
    ]
  ]
  logger:[
    print_freq: 100
    save_checkpoint_freq: 500
    visuals: ['gen', 'mel']
    visual_debug_rate: 500
    is_mel_spectrogram: True
    disable_state_saving: False
  ]
  upgrades:[
    number_of_checkpoints_to_save: 0
    number_of_states_to_save: 0
  ]
  is_train: True
  dist: False

23-05-01 03:03:10.734 - INFO: Random seed: 1337
23-05-01 03:03:16.202 - INFO: Number of training data elements: 552, iters: 4
23-05-01 03:03:16.203 - INFO: Total epochs needed: 12500 for iters 50,000
23-05-01 03:03:16.206 - INFO: Number of val images in [test1]: 139
C:\Users\King\miniconda3\envs\DLAS\lib\site-packages\transformers\configuration_utils.py:379: UserWarning: Passing `gradient_checkpointing` to a config initialization is deprecated and will be removed in v5 Transformers. Using `model.gradient_checkpointing_enable()` instead, or if you are using the `Trainer` API, pass `gradient_checkpointing=True` in your `TrainingArguments`.
  warnings.warn(
Loading from ../experiments/dvae.pth
23-05-01 03:03:24.183 - INFO: Network gpt structure: DataParallel, with parameters: 421,526,786
23-05-01 03:03:24.183 - INFO: UnifiedVoice(
  (conditioning_encoder): ConditioningEncoder(
    (init): Conv1d(80, 1024, kernel_size=(1,), stride=(1,))
    (attn): Sequential(
      (0): AttentionBlock(
        (norm): GroupNorm32(32, 1024, eps=1e-05, affine=True)
        (qkv): Conv1d(1024, 3072, kernel_size=(1,), stride=(1,))
        (attention): QKVAttentionLegacy()
        (x_proj): Identity()
        (proj_out): Conv1d(1024, 1024, kernel_size=(1,), stride=(1,))
      )
      (1): AttentionBlock(
        (norm): GroupNorm32(32, 1024, eps=1e-05, affine=True)
        (qkv): Conv1d(1024, 3072, kernel_size=(1,), stride=(1,))
        (attention): QKVAttentionLegacy()
        (x_proj): Identity()
        (proj_out): Conv1d(1024, 1024, kernel_size=(1,), stride=(1,))
      )
      (2): AttentionBlock(
        (norm): GroupNorm32(32, 1024, eps=1e-05, affine=True)
        (qkv): Conv1d(1024, 3072, kernel_size=(1,), stride=(1,))
        (attention): QKVAttentionLegacy()
        (x_proj): Identity()
        (proj_out): Conv1d(1024, 1024, kernel_size=(1,), stride=(1,))
      )
      (3): AttentionBlock(
        (norm): GroupNorm32(32, 1024, eps=1e-05, affine=True)
        (qkv): Conv1d(1024, 3072, kernel_size=(1,), stride=(1,))
        (attention): QKVAttentionLegacy()
        (x_proj): Identity()
        (proj_out): Conv1d(1024, 1024, kernel_size=(1,), stride=(1,))
      )
      (4): AttentionBlock(
        (norm): GroupNorm32(32, 1024, eps=1e-05, affine=True)
        (qkv): Conv1d(1024, 3072, kernel_size=(1,), stride=(1,))
        (attention): QKVAttentionLegacy()
        (x_proj): Identity()
        (proj_out): Conv1d(1024, 1024, kernel_size=(1,), stride=(1,))
      )
      (5): AttentionBlock(
        (norm): GroupNorm32(32, 1024, eps=1e-05, affine=True)
        (qkv): Conv1d(1024, 3072, kernel_size=(1,), stride=(1,))
        (attention): QKVAttentionLegacy()
        (x_proj): Identity()
        (proj_out): Conv1d(1024, 1024, kernel_size=(1,), stride=(1,))
      )
    )
  )
  (text_embedding): Embedding(256, 1024)
  (mel_embedding): Embedding(8194, 1024)
  (gpt): GPT2Model(
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0): GPT2Block(
        (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (1): GPT2Block(
        (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (2): GPT2Block(
        (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (3): GPT2Block(
        (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (4): GPT2Block(
        (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (5): GPT2Block(
        (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (6): GPT2Block(
        (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (7): GPT2Block(
        (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (8): GPT2Block(
        (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (9): GPT2Block(
        (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (10): GPT2Block(
        (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (11): GPT2Block(
        (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (12): GPT2Block(
        (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (13): GPT2Block(
        (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (14): GPT2Block(
        (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (15): GPT2Block(
        (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (16): GPT2Block(
        (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (17): GPT2Block(
        (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (18): GPT2Block(
        (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (19): GPT2Block(
        (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (20): GPT2Block(
        (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (21): GPT2Block(
        (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (22): GPT2Block(
        (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (23): GPT2Block(
        (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (24): GPT2Block(
        (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (25): GPT2Block(
        (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (26): GPT2Block(
        (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (27): GPT2Block(
        (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (28): GPT2Block(
        (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (29): GPT2Block(
        (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
  )
  (mel_pos_embedding): LearnedPositionEmbeddings(
    (emb): Embedding(608, 1024)
  )
  (text_pos_embedding): LearnedPositionEmbeddings(
    (emb): Embedding(404, 1024)
  )
  (final_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
  (text_head): Linear(in_features=1024, out_features=256, bias=True)
  (mel_head): Linear(in_features=1024, out_features=8194, bias=True)
)
23-05-01 03:03:24.188 - INFO: Loading model for [../experiments/autoregressive.pth]
23-05-01 03:03:25.660 - INFO: Start training from epoch: 0, iter: -1
  0%|                                                                                            | 0/4 [00:00<?, ?it/s]C:\Users\King\miniconda3\envs\DLAS\lib\site-packages\torch\optim\lr_scheduler.py:138: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. "
23-05-01 03:04:11.737 - INFO: [epoch:  0, iter:       0, lr:(1.000e-05,1.000e-05,)] step: 0.0000e+00 samples: 1.3800e+02 megasamples: 1.3800e-04 iteration_rate: 8.9322e-02 loss_text_ce: 3.8859e+00 loss_mel_ce: 3.5641e+00 loss_gpt_total: 3.6030e+00 grad_scaler_scale: 1.0000e+00 learning_rate_gpt_0: 1.0000e-05 learning_rate_gpt_1: 1.0000e-05 total_samples_loaded: 1.3800e+02 percent_skipped_samples: 0.0000e+00 percent_conditioning_is_self: 1.0000e+00 gpt_conditioning_encoder: 7.4988e+00 gpt_gpt: 4.9755e+00 gpt_heads: 3.7014e+00
23-05-01 03:04:11.737 - INFO: Saving models and training states.
100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [01:18<00:00, 19.70s/it]
100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [01:15<00:00, 18.81s/it]
100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [01:13<00:00, 18.49s/it]