/duet-svs-diffusion

Primary LanguagePythonMIT LicenseMIT

Unsupervised Singing Voice Separation

Source code of the paper Zero-Shot Duet Singing Voices Separation with Diffusion Models at the SDX workshop 2023.

Setup

Install requirements

pip install -r requirements.txt

Add environment variables, rename .env.tmp to .env and replace with your own variables (example values are random)

DIR_LOGS=/logs
DIR_DATA=/data

# Required if using wandb logger
WANDB_PROJECT=audioproject
WANDB_ENTITY=johndoe
WANDB_API_KEY=a21dzbqlybbzccqla4txa21dzbqlybbzccqla4tx

Training

The config we used for the paper is exp/singing.yaml, you can run it with

python train.py exp=singing

You'll need to download the relevant dataset and resample them to 24 kHz. Them, modified the datamodule section of the config to point to the right path.

Resume run from a checkpoint

python train.py exp=singing +ckpt=/logs/ckpts/2022-08-17-01-22-18/'last.ckpt'

Evaluation

First, download the MedleyVox dataset. Then, run the following command to evaluate the model on the duet subset of the dataset.

python eval.py logs/runs/XXXX/.hydra/config.yaml logs/ckpts/XXXX/last.ckpt /your/path/to/MedleyVox -T 100 --cond --hop-length 32768 --self-cond --retry 2

Some important arguments:

  1. -T: number of diffusion steps
  2. --cond: use auto-regressive conditioning on the ground truth (teacher forcing). Without this flag, the model will generate the full length audio at once
  3. --self-cond: perform auto-regressive conditioning on the generated audio if use together with --cond
  4. --hop-length: the hop length of the moving window
  5. --window: the size of the moving window. Default to the same length as training data
  6. --retry: number of retries for each auto-regressive step. The algorithm with generate retry + 1 candidates and pick the most similar one to the ground truth. Default to 0

For other arguments, please check out the code.

NMF baseline

This baseline depends on torchnmf.

python eval_nmf.py /your/path/to/MedleyVox/ --thresh 0.08 --division 10 --kernel-size 7

Checkpoint/Logs

Our pre-trained singing voice diffusion model can be downloaded here. You can find the training logs and unconditional singing samples generated during training on wandb.

FAQ

How do I load the model once I'm done training?

If you want to load the checkpoint to restore training with the trainer you can do python train.py exp=my_experiment +ckpt=/logs/ckpts/2022-08-17-01-22-18/'last.ckpt'.

Otherwise if you want to instantiate a model from the checkpoint:

from main.mymodule import Model
model = Model.load_from_checkpoint(
    checkpoint_path='my_checkpoint.ckpt',
    learning_rate=1e-4,
    beta1=0.9,
    beta2=0.99,
    in_channels=1,
    patch_size=16,
    all_other_paratemeters_here...
)

to get only the PyTorch .pt checkpoint you can save the internal model weights as torch.save(model.model.state_dict(), 'torchckpt.pt').

Why no checkpoint is created at the end of the epoch?

If the epoch is shorter than log_every_n_steps it doesn't save the checkpoint at the end of the epoch, but after the provided number of steps. If you want to checkpoint more frequently you can add every_n_train_steps to the ModelCheckpoint e.g.:

model_checkpoint:
    _target_: pytorch_lightning.callbacks.ModelCheckpoint
    monitor: "valid_loss"   # name of the logged metric which determines when model is improving
    save_top_k: 1           # save k best models (determined by above metric)
    save_last: True         # additionaly always save model from last epoch
    mode: "min"             # can be "max" or "min"
    verbose: False
    dirpath: ${logs_dir}/ckpts/${now:%Y-%m-%d-%H-%M-%S}
    filename: '{epoch:02d}-{valid_loss:.3f}'
    every_n_train_steps: 10

Note that logging the checkpoint so frequently is not recommended in general, since it takes a bit of time to store the file.