archinetai/audio-diffusion-pytorch-trainer

Adjusting batch_size + dataset in base_test.yaml yields noise

Ericxgao opened this issue · 1 comments

I'm trying to train some models off of some music using the trainer repo, with the following yaml config:

# @package _global_

# Test with length 65536, batch size 4, logger sampling_steps [3]

sampling_rate: 48000
length: 65536
channels: 2
log_every_n_steps: 2000

model:
  _target_: main.module_base.Model
  lr: 1e-4
  lr_beta1: 0.95
  lr_beta2: 0.999
  lr_eps: 1e-6
  lr_weight_decay: 1e-3
  ema_beta: 0.9999
  ema_power: 0.7

  model:
    _target_: audio_diffusion_pytorch.AudioDiffusionModel
    in_channels: ${channels}
    channels: 128
    patch_factor: 16
    patch_blocks: 1
    resnet_groups: 8
    kernel_multiplier_downsample: 2
    multipliers: [1, 2, 4, 4, 4, 4, 4]
    factors: [4, 4, 4, 2, 2, 2]
    num_blocks: [2, 2, 2, 2, 2, 2]
    attentions: [0, 0, 0, 1, 1, 1, 1]
    attention_heads: 8
    attention_features: 64
    attention_multiplier: 2
    use_nearest_upsample: False
    use_skip_scale: True
    use_magnitude_channels: True
    diffusion_sigma_distribution:
      _target_: audio_diffusion_pytorch.UniformDistribution

datamodule:
  _target_: main.module_base.Datamodule
  dataset:
    _target_: audio_data_pytorch.YoutubeDataset
    urls:
      - https://www.youtube.com/watch?v=FrMugs5eits
      - https://www.youtube.com/watch?v=orrwpGhLjJo
      - https://www.youtube.com/watch?v=OYmkDEdO5Ek
      - https://www.youtube.com/watch?v=PgDUaKjIGLQ
      - https://www.youtube.com/watch?v=zXncanvMfhg
      - https://www.youtube.com/watch?v=W0EYGtK-DwE
      - https://www.youtube.com/watch?v=ImgwN3u7Af0
      - https://www.youtube.com/watch?v=ohsLkUlCu3I
      - https://www.youtube.com/watch?v=vuV5DuVqDcw
      - https://www.youtube.com/watch?v=kxi_vU-yJLg
      - https://www.youtube.com/watch?v=-JPMd_NiY10
      - https://www.youtube.com/watch?v=pHzf2FkNCIQ
      - https://www.youtube.com/watch?v=mwpxQLeVKuo
      - https://www.youtube.com/watch?v=WYbc32bQozo
      - https://www.youtube.com/watch?v=LEGRJpOo7Ts
      - https://www.youtube.com/watch?v=IiURF2gxUnc
      - https://www.youtube.com/watch?v=43ZYv36QnVw
    root: ${data_dir}
    crop_length: 12 # seconds crops
    transforms:
      _target_: audio_data_pytorch.AllTransform
      source_rate: ${sampling_rate}
      target_rate: ${sampling_rate}
      random_crop_size: ${length}
      loudness: -20
  val_split: 0.01
  batch_size: 300
  num_workers: 8
  pin_memory: True

callbacks:
  rich_progress_bar:
    _target_: pytorch_lightning.callbacks.RichProgressBar

  model_checkpoint:
    _target_: pytorch_lightning.callbacks.ModelCheckpoint
    monitor: "valid_loss"   # name of the logged metric which determines when model is improving
    save_top_k: 1           # save k best models (determined by above metric)
    save_last: True         # additionaly always save model from last epoch
    mode: "min"             # can be "max" or "min"
    verbose: False
    dirpath: ${logs_dir}/ckpts/${now:%Y-%m-%d-%H-%M-%S}
    filename: '{epoch:02d}-{valid_loss:.3f}'

  model_summary:
    _target_: pytorch_lightning.callbacks.RichModelSummary
    max_depth: 2

  audio_samples_logger:
    _target_: main.module_base.SampleLogger
    num_items: 4
    channels: ${channels}
    sampling_rate: ${sampling_rate}
    length: ${length}
    sampling_steps: [3]
    use_ema_model: True
    diffusion_sampler:
      _target_: audio_diffusion_pytorch.VSampler
    diffusion_schedule:
      _target_: audio_diffusion_pytorch.LinearSchedule

loggers:
  wandb:
    _target_: pytorch_lightning.loggers.wandb.WandbLogger
    project: ${oc.env:WANDB_PROJECT}
    entity: ${oc.env:WANDB_ENTITY}
    # offline: False  # set True to store all logs only locally
    job_type: "train"
    group: ""
    save_dir: ${logs_dir}

trainer:
  _target_: pytorch_lightning.Trainer
  gpus: 0 # Set `1` to train on GPU, `0` to train on CPU only, and `-1` to train on all GPUs, default `0`
  precision: 32 # Precision used for tensors, default `32`
  accelerator: null # `ddp` GPUs train individually and sync gradients, default `None`
  min_epochs: 0
  max_epochs: -1
  enable_model_summary: False
  log_every_n_steps: 1 # Logs metrics every N batches
  check_val_every_n_epoch: null
  val_check_interval: ${log_every_n_steps}

This is a modification of the base_test.yaml config file, so I don't think there should be anything super off? I trained for about 4400 epochs over 10 hours.

My inference script is as follows:

# @title Download Model 
import torch 
from main.module_base import Model
from audio_diffusion_pytorch import AudioDiffusionModel, UniformDistribution

adm = AudioDiffusionModel(
    in_channels=2,
    channels=128,
    patch_factor=16,
    patch_blocks=1,
    resnet_groups=8,
    kernel_multiplier_downsample=2,
    multipliers=[1, 2, 4, 4, 4, 4, 4],
    factors=[4, 4, 4, 2, 2, 2],
    num_blocks=[2, 2, 2, 2, 2, 2],
    attentions=[0, 0, 0, 1, 1, 1, 1],
    attention_heads=8,
    attention_features=64,
    attention_multiplier=2,
    use_nearest_upsample=False,
    use_skip_scale=True,
    use_magnitude_channels=True,
    diffusion_sigma_distribution=UniformDistribution
)

adm = adm.to('cuda')

model = Model.load_from_checkpoint(
    checkpoint_path='/home/fsuser/audio-diffusion-pytorch-trainer/logs/ckpts/2022-10-20-08-43-18/last.ckpt',
    lr=1e-4,
    lr_beta1=0.95,
    lr_beta2=0.999,
    lr_eps=1e-6,
    lr_weight_decay=1e-3,
    ema_beta=0.9999,
    ema_power=0.7,
    model=adm
)

from audio_diffusion_pytorch import KarrasSchedule, VSampler
import torchaudio
import math 

sampling_rate = 48000
# @markdown Generation length in seconds (will be rounded to be a power of 2 of sample_rate*length)
length = 10 #@param {type: "slider", min: 1, max: 87, step: 1}
length_samples = math.ceil(math.log2(length * sampling_rate))
# @markdown Number of samples to generate 
num_samples = 5 #@param {type: "slider", min: 1, max: 16, step: 1}
# @markdown Number of diffusion steps (higher tends to be better but takes longer to generate)
num_steps = 100 #@param {type: "slider", min: 1, max: 200, step: 1}

with torch.no_grad():
    samples = adm.sample(
        noise=torch.randn((num_samples, 2, 2 ** length_samples), device='cuda'),
        num_steps=num_steps,
        sigma_schedule=KarrasSchedule(
            sigma_min=1e-4, 
            sigma_max=10.0,
            rho=7.0
        ),
        sampler=VSampler(),
    )

# Log audio samples 
for i, sample in enumerate(samples):
    cpu_sample = sample.cpu()
    torchaudio.save(f'./audio_sample_{i}.wav', cpu_sample, sampling_rate)

All I get out is this strange buzz: https://soundcloud.com/itsoksami/audio-sample-0/s-QuWAjmeS7OK?si=9a7dbf264ad74915aa872c4043d09196&utm_source=clipboard&utm_medium=text&utm_campaign=social_sharing

Is there anything I'm doing blatantly wrong here?

Originally posted by @Ericxgao in archinetai/audio-diffusion-pytorch#29

I'm using the wrong sigma_schedule - LinearDiffusion() works