Repository containing the code and precedures to reproduce the ICASSP 2022 submission "Podcastmix: A dataset for separating music and speech in podcasts". All links to download the dataset, train, evaluate and separate Podcasts are included here. Feel free to use the dataset for any other purposes.
If you don't want to train and evaluate the network, but only use the pretrained models, we have uploaded them to a separated repository, so you don't have to download the dataset and could jump right to separate your podcasts.
Download the directory structure with the test sets (podcastmix-real-no-reference and podcastmix-real-with-reference):
/bin/bash download_podcastmix.sh
The train set of the dataset is hosted here (~480Gb)
We provide a script to download each of the files quickly, but it requires that you obtain a OAuth2 ApiKey from the Google Developers Console:
- Go to OAuth 2.0 Playground
- In the “Select the Scope” box, scroll down, expand “Drive API v3”, and select
https://www.googleapis.com/auth/drive.readonly
- Click “Authorize APIs” and then “Exchange authorization code for tokens”. Copy the “Access token”.
- Run the following script using the "Access token" as a parameter:
/bin/bash download_podcastmix_synth_set.sh <Access token>
NOTE: The Access Token has an expiration time (1 hour), so if your connection is not fast enough to download all the compressed files within one hour, you will have to refresh the Access Token and re-run the script using the new Access Token. The script will only download the remaining files and not all of them again.
Create a conda environment:
conda env create -f environment.yml
Activate the environment:
conda activate Podcastmix
[MODEL]
could be ConvTasNet
or UNet
You can specify the GPUs to train, by using CUDA_VISIBLE_DEVICES.
NOTE If you want to train the ConvTasNet using 44100 as
sample_rate
, you will probably need to use at least 2 GPUs due to memory limitations.
CUDA_VISIBLE_DEVICES=0,1 python train.py \
--config_model [MODEL]_model/[MODEL]_config.yml
After each epoch, the system will evaluate the best 10 models so far and save them as checkpoints inside exp/tmp/checkpoints.
If you want to resume the training from a previously saved checkpoint, you can do it using the --resume_from
flag:
CUDA_VISIBLE_DEVICES=0,1 python train.py \
--config_model [MODEL]_model/[MODEL]_config.yml \
--resume_from=<path-to-checkpoint-file>
Example:
CUDA_VISIBLE_DEVICES=0,1 python train.py \
--config_model UNet_model/UNet_config.yml \
--resume_from=UNet_model/exp/tmp/checkpoints/epoch\=1-step\=30985.ckpt
CUDA_VISIBLE_DEVICES=0,1 python test.py --target_model [MODEL] --test_dir podcastmix/podcastmix-synth/metadata/test/ --out_dir=separations --exp_dir=[MODEL]_model/exp/tmp/ --n_save_ex=20 --use_gpu=1
This script will separate the podcastmix-real-with-reference which consists of the mix and the ground thuth files for music and speech. It will forward the mixes to the network, estimate the separated sources and evaluate the objetive metrics against the ground truth.
CUDA_VISIBLE_DEVICES=0,1 python test_real.py --target_model [MODEL] --test_dir podcastmix/podcastmix-real-with-reference/metadata --out_dir=separations --exp_dir=[MODEL]_model/exp/tmp/ --n_save_ex=-1 --use_gpu=1
--out_dir
: Name of the folder where the separated audios will be saved. The folder will be created inside the--exp_dir
directory.--n_save_ex
: Integer to indicate the number of examples to be actually saved on the local directory. If -1, then all the examples will be saved.
You can use your previously trained model or use the other repository to download pre-trained models and separate them
CUDA_VISIBLE_DEVICES=0,1 python forward_podcast.py \
--test_dir=<directory-of-the-podcastmix-real-no-reference-or-your-own-files> --target_model=[MODEL] \
--exp_dir=[MODEL]_model/exp/tmp --out_dir=separations \
--segment=18 --sample_rate=44100 --use_gpu=1
--segment
: size of the audio to be input to the network and separated in seconds. Due to the size of the convolutions, the UNet only supports 2 + 16*i seconds segments (2, 18, 34, 50, ...). ConvTasNet supports segments of any size.--out_dir
: Name of the folder where the separated audios will be saved. The folder will be created inside the--exp_dir
directory.